Word Count is a classic MapReduce example. So, we decided on creating a Word Count application.
But what exactly is a word count application? It is a program that reads data and calculates the most common words. For example:

dataDF = sqlContext.createDataFrame([('Jax',), ('Rammus',), ('Zac',), ('Xin', ), 
('Hecarim', ), ('Zac', ), ('Rammus', )], ['Jungler'])

Word Count Example

We developed this project using Apache Spark 2.0.2 on Databricks Cloud and used A Tale of Two Cities by Charles Dickens from Project Gutenberg as our dataset.

Once we read the dataset, we make the following three assumptions in order to clean the dataset:

  • Words should be counted independent of their capitalization.
  • Punctuation should be removed.
  • Leading and trailing spaces should be removed.

To keep the above mentioned assumptions intact we create a function called removePunctuation to perform the operations.

from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
  """Removes punctuation, changes to lower case, strips leading and trailing spaces.

    column: a sentence

    Column: a column named 'sentence'

  return lower(trim(regexp_replace(column, '[^\w\s]+', ""))).alias('sentence')

Once that is done, we define a function for word counting.

def wordCount(wordsDF):
  """ Creates a DataFrame with word counts.

      wordsDF: A DataFrame consisting of one string column called 'Jungler'.

      DataFrame of a (str, int): containing Jungler and count columns.
  return (wordsDF

Before we call the wordCount(), we address the following two issues:

  • Need to split each line by spaces.
  • Filter out empty lines or words.

from pyspark.sql.functions import split,explode
twoCitiesWordsDF = (twoCitiesDF
                    .select(explode(split(twoCitiesDF.sentence, " ")).alias('word'))
                    .where("word != ''")
twoCitiesWordsDFCount = twoCitiesWordsDF.count()
print twoCitiesWordsDFCount

The following are the top 20 most common words:

Top 20 Words

Project is hosted on github repository and can be found here