SparkR (R on Spark) is an R package that provides a light-weight frontend to use Apache Spark from R. For this exercise I have used RStudio and Spark 1.6.1.
SparkR provides a distributed data frame implementation that supports operations like selection, filtering etc.

In RStudio, run the following code to check the system enviroment variables for spark home.

Sys.getenv()

For some reason if you don’t see SPARK_HOME set or incorrect path, you can change it via the environment variables. Once everything is working fine, we run the following code in order to load SparkR library and pass the needed drivers.


library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sc <- sparkR.init(master = "local[*]", 
sparkEnvir = list(spark.driver.memory="2g", 
sparkPackages="com.databricks:spark-csv_2.11:1.0.3"))
sqlContext <- sparkRSQL.init(sc)

sc is the sparkcontext which is the entry point into SparkR. Here we created sparkcontext by by using sparkR.init

Now, we are ready to use SparkR API. Lets read the dataframe and convert it into a Spark DataFrame.


df_o <- read.csv("Iris.csv")
df <- createDataFrame(sqlContext, df_o)
head(df)

Lets do some simple operations using SparkR. You can find the API here

# select and filter operations.
head(select(df, df$SepalLengthCm, df$Species))
head(filter(df, df$SepalLengthCm > 5.0))

# Compute average PetalLengthCm and group by Species.
head(agg(groupBy(df, "Species"), PetalLengthCm="avg"))

# Returns the schema of this DataFrame as a structType object.
dfSchema <- schema(df)
dfSchema

# Sort the DataFrame by the specified column.
head(arrange(df, df$SepalLengthCm))
# Sort in decreasing order
head(arrange(df, "SepalLengthCm", decreasing = TRUE))
#Print the first numRows rows of a DataFrame
showDF(df)

# Running SQL Queries from SparkR
registerTempTable(df, "iris")
irisSepalLGreater5 <- sql(sqlContext, 
    "SELECT Id, Species FROM iris WHERE SepalLengthCm > 5.0")
head(irisSepalLGreater5)

Don’t get confused, when we use SparkR we can do all the operations that we do in R. For example, I’ve used ggplotly to create some infographics. Full script here

PetalLength vs PetalWidth

PetalLength vs PetalWidth within each Species