Hey, so I'm a big fan of Apache Spark and I've been using it for all of my independent projects. I recently had this idea to create a project that would showcase how to do some data wrangling with Apache Spark.
For this project, we used Apache Spark 2.0.2 on Databricks cloud. Instead of using publicly available data, we generated fake person records using the Python package 'fake-factory'. With this data, we created a Spark DataFrame like so:
dataDF = sqlContext.createDataFrame(data, ('last_name', 'first_name', 'occupation', 'company', 'age'))
dataDF.printSchema()
Once we had a dataframe, we were able to apply all sorts of transformations and actions to it depending on our needs. Even when we were just trying to understand the data, the Spark DataFrame API was super helpful and easy to pick up - especially if you're already familiar with pandas.
For example, we could select first and last names and create a new data frame like this:
# select first and last name and create a new data frame
# this is a transformation so we get a new dataframe because dataframe is immutable
subDF = dataDF.select('last_name', 'first_name')
# filter operation
filteredDF = dataDF.filter(dataDF.age < 13)
filteredDF.show(truncate=False)
We could also use Python lambda functions, but I wanted to take this opportunity to show you how to use a Spark User Defined Function (UDF). A UDF is a special wrapper around a function that allows it to be used in a DataFrame query. Check it out:
# python lambda functions and udf
from pyspark.sql.types import BooleanType
less_13 = udf(lambda s: s < 13, BooleanType())
lambdaDF = dataDF.filter(less_13(dataDF.age))
lambdaDF.show()
lambdaDF.count()
If you're interested, you can find the project on this github repository. Hopefully, it'll give you a good idea of how you can leverage Apache Spark for your own data wrangling projects. Good luck!