Hey there! So we decided to create a Word Count application - a classic MapReduce example. But what the heck is a Word Count application, you ask? It's basically a program that reads data and calculates the most common words. Easy peasy. For example:
dataDF = sqlContext.createDataFrame([('Jax',), ('Rammus',), ('Zac',), ('Xin', ), ('Hecarim', ), ('Zac', ), ('Rammus', )], ['Jungler'])
For this project, we used Apache Spark 2.0.2 on Databricks Cloud and used A Tale of Two Cities by Charles Dickens from Project Gutenberg as our dataset.
Once we read the dataset, we made the following assumptions to clean the data:
- Words should be counted regardless of capitalization.
- Punctuation should be removed.
- Leading and trailing spaces should be removed.
To make sure we stuck to these assumptions, we created a function called removePunctuation to perform the operations:
from pyspark.sql.functions import regexp_replace, trim, col, lower def removePunctuation(column): """Removes punctuation, changes to lower case, strips leading and trailing spaces. Args: column: a sentence Returns: Column: a column named 'sentence' """ return lower(trim(regexp_replace(column, '[^\w\s]+', ""))).alias('sentence')
Once that was done, we defined a function for word counting:
def wordCount(wordsDF): """ Creates a DataFrame with word counts. Args: wordsDF: A DataFrame consisting of one string column called 'Jungler'. Returns: DataFrame of a (str, int): containing Jungler and count columns. """ return (wordsDF .groupBy('word') .count())
Before we called the wordCount() function, we addressed two issues:
- We needed to split each line by spaces.
- We needed to filter out empty lines or words.
So we did that like this:
from pyspark.sql.functions import split,explode twoCitiesWordsDF = (twoCitiesDF .select(explode(split(twoCitiesDF.sentence, " ")).alias('word')) .where("word != ''") ) twoCitiesWordsDF.show(truncate=False) twoCitiesWordsDFCount = twoCitiesWordsDF.count() print twoCitiesWordsDFCount
And voila! Here are the top 20 most common words:
If you're interested, you can find the project on this github repository. Happy word counting!