I use Apache Spark exhaustively for my independent projects, which led me to this notion of creating a project that describes how to data wrangling with Apache Spark.

We used Apache Spark 2.0.2 on Databricks cloud for this project. Instead of using publicly available data, we created fake person records from Python package ‘fake-factory’. Using this data, we created a Spark DataFrame


dataDF = sqlContext.createDataFrame
        (data, ('last_name', 'first_name', 'occupation', 'company', 'age'))
dataDF.printSchema()

Once we have a dataframe, we can apply many transformation and actions depending upon what we need. Even when we are just trying to understand, spark dataframe api is very helpful and easy to pickup especially if you have worked with pandas before.


# select first and last name and create a new data frame. 
this is a transformation so we get a new dataframe because dataframe is immutable.
subDF = dataDF.select('last_name', 'first_name')
# filter operation
filteredDF = dataDF.filter(dataDF.age < 13)
filteredDF.show(truncate=False)

We can use python lambda functions as well. But, I would like to take this opportunity to show how we can use Spark User Defined Function (UDF). A UDF is a special wrapper around a function, allowing the function to be used in a DataFrame query.

# python lambda functions and udf
from pyspark.sql.types import BooleanType
less_13 = udf(lambda s: s < 13, BooleanType())
lambdaDF = dataDF.filter(less_13(dataDF.age))
lambdaDF.show()
lambdaDF.count()

The project is hosted on github repository and I hope you get to see how you can leverage Apache Spark for your data wrangling projects