Unlocking the Power of MapReduce: Using Python and Apache Spark for Enhanced Data Processing

Hey there! So we decided to create a Word Count application - a classic MapReduce example. But what the heck is a Word Count application, you ask? It's basically a program that reads data and calculates the most common words. Easy peasy. For example:

dataDF = sqlContext.createDataFrame([('Jax',), ('Rammus',), ('Zac',), ('Xin', ), 
('Hecarim', ), ('Zac', ), ('Rammus', )], ['Jungler'])
Word Count Example
Word Count Example

For this project, we used Apache Spark 2.0.2 on Databricks Cloud and used A Tale of Two Cities by Charles Dickens from Project Gutenberg as our dataset.

Once we read the dataset, we made the following assumptions to clean the data:

  • Words should be counted regardless of capitalization.
  • Punctuation should be removed.
  • Leading and trailing spaces should be removed.

To make sure we stuck to these assumptions, we created a function called removePunctuation to perform the operations:

from pyspark.sql.functions import regexp_replace, trim, col, lower

def removePunctuation(column):
  """Removes punctuation, changes to lower case, strips leading and trailing spaces.

  Args:
    column: a sentence

  Returns:
    Column: a column named 'sentence'
  """
  return lower(trim(regexp_replace(column, '[^\w\s]+', ""))).alias('sentence')

Once that was done, we defined a function for word counting:

def wordCount(wordsDF):
  """ Creates a DataFrame with word counts.

  Args:
      wordsDF: A DataFrame consisting of one string column called 'Jungler'.

  Returns:
      DataFrame of a (str, int): containing Jungler and count columns.
  """
  return (wordsDF
           .groupBy('word')
           .count())

Before we called the wordCount() function, we addressed two issues:

  • We needed to split each line by spaces.
  • We needed to filter out empty lines or words.

So we did that like this:

from pyspark.sql.functions import split,explode

twoCitiesWordsDF = (twoCitiesDF
                    .select(explode(split(twoCitiesDF.sentence, " ")).alias('word'))
                    .where("word != ''")
                    )

twoCitiesWordsDF.show(truncate=False)

twoCitiesWordsDFCount = twoCitiesWordsDF.count()
print twoCitiesWordsDFCount

And voila! Here are the top 20 most common words:

Top 20 Words
Top 20 Words

If you're interested, you can find the project on this github repository. Happy word counting!