Word Count is a classic MapReduce example. So, we decided on creating a Word Count application.
But what exactly is a word count application? It is a program that reads data and calculates the most common words. For example:
dataDF = sqlContext.createDataFrame([('Jax',), ('Rammus',), ('Zac',), ('Xin', ), ('Hecarim', ), ('Zac', ), ('Rammus', )], ['Jungler'])
We developed this project using Apache Spark 2.0.2 on Databricks Cloud and used A Tale of Two Cities by Charles Dickens from Project Gutenberg as our dataset.
Once we read the dataset, we make the following three assumptions in order to clean the dataset:
- Words should be counted independent of their capitalization.
- Punctuation should be removed.
- Leading and trailing spaces should be removed.
To keep the above mentioned assumptions intact we create a function called removePunctuation to perform the operations.
from pyspark.sql.functions import regexp_replace, trim, col, lower def removePunctuation(column): """Removes punctuation, changes to lower case, strips leading and trailing spaces. Args: column: a sentence Returns: Column: a column named 'sentence' """ return lower(trim(regexp_replace(column, '[^\w\s]+', ""))).alias('sentence')
Once that is done, we define a function for word counting.
def wordCount(wordsDF): """ Creates a DataFrame with word counts. Args: wordsDF: A DataFrame consisting of one string column called 'Jungler'. Returns: DataFrame of a (str, int): containing Jungler and count columns. """ return (wordsDF .groupBy('word') .count())
Before we call the wordCount(), we address the following two issues:
- Need to split each line by spaces.
- Filter out empty lines or words.
from pyspark.sql.functions import split,explode twoCitiesWordsDF = (twoCitiesDF .select(explode(split(twoCitiesDF.sentence, " ")).alias('word')) .where("word != ''") ) twoCitiesWordsDF.show(truncate=False) twoCitiesWordsDFCount = twoCitiesWordsDF.count() print twoCitiesWordsDFCount
The following are the top 20 most common words:
Project is hosted on github repository and can be found here