We’ve taken the dataset from RealClearPolitics. Before we dive into analysis, lets get some understanding of United States Presidential Election. The key points are as follows:

  • A president is elected every four years.
  • United States have 50 states and each is assigned a number of electoral votes based on the population.
  • Candidate with most electoral votes win the election.

Now, lets talk about the dataset and how we handled missing data.

  • Dependent Variable: 1 if Republic won a state and 0 if a Democrat won.
  • Independent Variables: Rasmussen, SurveyUSA, DiffCount and PropR.s
  • More than 50% of the data is missing and we want to retain the data. So we used Multiple Imputation by Chained Equations (mice) package for imputation.

Once we impute, we split the data into Train and Test Data. We’ve taken data from years 2004 and 2008 as training data while testing data is from the year 2012. We did logistic regression using SurveyUSA and DiffCount as the independent variables. The full script is here

In our analysis, we predicted Republicans would win Florida but Democrats won. So, we analyzed our mistake

Prediction Mistake Analyzed:

Prediction Mistake Analyzed

United States Map:

United States Map

States according to our binary predictions. Light blue represents Republican:

States according to our binary predictions. Light blue represents Republican

Predictions with discrete outcomes:

Predictions with discrete outcomes