Lets do some ML with SparkR 1.6. As the package only gives us the option to do either linear or logistic regression, so for this exercise we are going to train a logistic regression model. I have refactored one of my previous project which was written in R to make it work with SparkR. You can find the script here . There seems a bug when we try to train a regression model with more than 8 predictors. So I’ve had to do some feature engineering in selecting the 8 important predictors. The data set has been taken from UCI ML Repository. Ok, lets do some fun work.

In RStudio, run the following code to check the system enviroment variables for spark home.

Sys.getenv()

System Environment Variables Output

For some reason if you don’t see SPARK_HOME set or the path is incorrect, you can change it via the environment variables. Once everything is working fine, we run the following code in order to load SparkR library and pass the needed drivers.


library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sc <- sparkR.init(master = "local[*]", 
sparkEnvir = list(spark.driver.memory="2g", 
sparkPackages="com.databricks:spark-csv_2.11:1.0.3"))
sqlContext <- sparkRSQL.init(sc)

Lets read the daraframe in r and output its contents.


df_r <- read.csv("census.csv")
str(df_r)

R Data fram

Now, lets convert the R data frame to a SparkR DataFrame and print the schema.


census <- createDataFrame(sqlContext, df_r)
nrow(census)
showDF(census)
printSchema(census)

Schema Details

Next, lets randomly split the data so that we get a better regression model.


trainingData <- sample(census, FALSE, 0.6)
testData <- except(census, trainingData)

Once we have the training and test data, lets train our model using Logistic Regression and do predictions.


regModel <- glm(over50k ~ age + workclass + education + maritalstatus + occupation + 
    race + sex + hoursperweek, data = trainingData, family = "binomial")
summary(regModel)

Logistic Regression Summary


predictionsRegModel <- predict(regModel, newData = testData)
showDF(select(predictionsRegModel, "label", "prediction"))

Prediction

Finally, lets calculate error for each label.


errorsLogR <- select(predictionsRegModel, predictionsRegModel$label, 
    predictionsRegModel$prediction, alias(abs(predictionsRegModel$label - 
        predictionsRegModel$prediction), "error"))
showDF(errorsLogR)

Error Summary

The full script can be found here

After working with SparkR API, I have come to understand that SparkR 1.6 is still in an early phase and have some bugs and limitations. It would be better to use SparkML package as it is more stable, provides more ML algorithms and uses DataFrames for constructing ML pipelines.