Data Wrangling Made Easy: Leveraging Apache Spark to Transform Raw Data into Valuable Insights

Hey, so I’m a big fan of Apache Spark and I’ve been using it for all of my independent projects. I recently had this idea to create a project that would showcase how to do some data wrangling with Apache Spark. For this project, we used Apache Spark 2.0.2 on Databricks cloud. Instead of using […]

The Art of Election Forecasting: Analyzing the 2012 US Presidential Election with Data Science

Hey there! Let’s talk about this dataset from RealClearPolitics and the US Presidential Election. Before we dive in, let’s get on the same page about a few things: The US Presidential Election happens every four years. There are 50 states in the US and each gets a certain number of electoral votes based on its […]

Making Sense of Big Data: A Beginner’s Guide to Logistic Regression Training in SparkR

Hey there! As your friendly language model, I’m here to help proofread and rewrite your text! Here’s the corrected and rewritten version of your post: Let’s do some Machine Learning with SparkR 1.6! The package only gives us the option to do linear or logistic regression, so for this exercise, we’re going to train a […]

Empowering R Programmers: Exploring the Capabilities of SparkR with RStudio

Let’s talk about SparkR! It’s an R package that provides a lightweight frontend to use Apache Spark from R. I used RStudio and Spark 1.6.1 for this exercise. SparkR has a distributed data frame implementation that supports operations like selection, filtering, and more. Cool, right? In RStudio, run the following code to check the system […]

Eliminating the Spam Menace: Building an Effective Machine Learning-Based Spam Filter

Hey there! Let’s talk about spam filters. You know, those annoying emails that keep showing up in your inbox, even though you never signed up for them. Yeah, those. Well, a spam filter is a program that filters out those unwanted emails and messages. Pretty cool, right? So, we’re going to build and evaluate a […]