Lets build and evaluate a spam filter. Before we dive into that, lets ask ourselves what a spam filter does?
A spam filter is basically a program that filters unwanted emails or messages. It is basically used in email filtering but I think it can be applied to many other applications by building different versions of the program for different modes of communication.

The dataset if taken from MIT website and has two datafields:

  • text: text of the email
  • spam: 1 if spam 0 otherwise.

Once we create a data frame from DocumentTermMatrix, we implemented three different models:


#logistic regression model. this is going to overfit.
spamLog <- glm(spam ~ ., data = train, family = binomial())
#classification tree model
spamCART <- rpart(spam ~., data = train, method = 'class')
set.seed(123)
#random forest model
spamRF <- randomForest(spam ~ ., data = train)

We then evaluate these models using the test data:


predict_log2 <- predict(spamLog, type = 'response', newdata = test)
predict_cart2 <- predict(spamCART, newdata = test)[, 2]
predict_rf2 <- predict(spamRF, type='prob', newdata = test)[, 2]

All of the models work really well but Random Forests works the best. We can evaluate the models even more by creating and iterating over the new test datasets.

The full script can be found here

Most Appeared Terms:

Most Appeared Terms

Focused on most appeared terms:

Focused terms

30 most appeared terms:

30 most appeared terms