Lets build and evaluate a spam filter. Before we dive into that, lets ask ourselves what a spam filter does?
A spam filter is basically a program that filters unwanted emails or messages. It is basically used in email filtering but I think it can be applied to many other applications by building different versions of the program for different modes of communication.
The dataset if taken from MIT website and has two datafields:
- text: text of the email
- spam: 1 if spam 0 otherwise.
Once we create a data frame from DocumentTermMatrix, we implemented three different models:
#logistic regression model. this is going to overfit. spamLog <- glm(spam ~ ., data = train, family = binomial()) #classification tree model spamCART <- rpart(spam ~., data = train, method = 'class') set.seed(123) #random forest model spamRF <- randomForest(spam ~ ., data = train)
We then evaluate these models using the test data:
predict_log2 <- predict(spamLog, type = 'response', newdata = test) predict_cart2 <- predict(spamCART, newdata = test)[, 2] predict_rf2 <- predict(spamRF, type='prob', newdata = test)[, 2]
All of the models work really well but Random Forests works the best. We can evaluate the models even more by creating and iterating over the new test datasets.
The full script can be found here
Most Appeared Terms:
Focused on most appeared terms:
30 most appeared terms: