In this Spam Filter Project Using R, we'll be building a spam filter that will classify text messages as spam or not using the Naive Bayes algorithm.
This blog will explain the steps involved in this project, starting from loading and cleaning the dataset to training and evaluating the model.
This project will also teach concepts like text mining, natural language processing, and how to use R packages like tm, e1071, and caret.
Ready to Future-Proof Your Career in Data Science and AI? Learn Generative AI, Machine Learning, and Advanced Analytics with upGrad’s Expert-Led Online Data Science Programs.
These are the tools and libraries that’ll be used in this project. From R to Naive Bias, these help in ensuring that the Spam Filter Project runs smoothly.
A probabilistic classifier used for spam detection
Advance Your Career with Industry-Driven Data Science and AI Programs Gain in-demand skills in Generative AI, Machine Learning, and Advanced Analytics through globally recognized online programs.
Step-by-Step Explanation For This Spam Filter Project Using R and Naive Bias
The following section will break down the step-by-step process of creating this project and running it in Colab.
Step 1: Setting Up Google Colab for R and Uploading the Dataset
Before we start creating our spam filter project, we need to set up the right tools. We’ll use Google Colab, which is a cloud-based platform that usually runs Python, but we can also run R code by switching the runtime. Download the CSV dataset from platforms like Kaggle and follow these steps.
Here's how to get started:
1. Open Google Colab
2. Change Runtime to R:
Go to the top menu: Runtime > Change runtime type
In the "Runtime type" dropdown, select R
Click Save
Now you're ready to run R code inside Colab.
Step 2: Installing Required R Libraries
After setting up Colab, we will install the R libraries required for this Spam Filter Project. These libraries are the tools that give us functions to clean text, build models, and evaluate results. You only need to install these once. The code for this step is:
# Install only once, skip this step if already installed in your session
install.packages("tm") # Used for text mining and preprocessing
install.packages("e1071") # Provides the Naive Bayes algorithm for classification
install.packages("caret") # Helps with splitting data and evaluating model performance
install.packages("stringr") # Useful for string operations like pattern matching
install.packages("dplyr") # Makes it easier to filter, select, and manipulate data
In this step, we’ll load the dataset that we have downloaded and read it. Use the following code for this step.
# Load the required libraries into memory
library(tm) # Text mining
library(e1071) # Naive Bayes classification
library(caret) # For data splitting and model evaluation
library(stringr) # For string operations
library(dplyr) # For data manipulation
# Read the CSV file containing spam/ham messages
data <- read.csv("SPAM text message 20170820 - Data.csv", stringsAsFactors = FALSE)
# Display the first few rows of the dataset to understand its structure
head(data)
The output for this section will be:
Category
Message
<chr>
<chr>
1
ham
Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
2
ham
Ok lar... Joking wif u oni...
3
spam
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
4
ham
U dun say so early hor... U c already then say...
5
ham
Nah I don't think he goes to usf, he lives around here though
6
spam
FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
Step 4: Exploring and Preparing the Dataset
In this step, we’ll take a better look at the structure of our dataset and prepare it for modeling. Here we will check the column names, rename them to something easier to work with, convert the labels to a categorical format (factor), and finally, inspect how many spam vs. ham messages the dataset has. We’ll use the following code:
# Check the column names in the dataset
colnames(data)
# Rename the columns to "label" (spam/ham) and "text" (message content)
colnames(data) <- c("label", "text")
# Convert the label column to a factor so R knows it represents categories
data$label <- factor(data$label)
# Check how many spam and ham messages are in the dataset
table(data$label)
Step 5: Cleaning the Text and Creating a Document-Term Matrix
In this step, we will convert all text to lowercase. We will remove numbers and punctuation, eliminate common words (like "the", "and"), and clean up extra spaces. After cleaning, we will convert the text into a format that a machine learning model can understand: a Document-Term Matrix (DTM), where rows are messages and columns are words.
Use the code below for this step:
# Create a corpus (collection of all text messages)
corpus <- VCorpus(VectorSource(data$text))
# Clean the text in multiple steps using the tm_map function
corpus_clean <- corpus %>%
tm_map(content_transformer(tolower)) %>% # Convert text to lowercase
tm_map(removeNumbers) %>% # Remove numbers
tm_map(removePunctuation) %>% # Remove punctuation marks
tm_map(removeWords, stopwords("english")) %>% # Remove common English stopwords (e.g., "the", "and")
tm_map(stripWhitespace) # Remove unnecessary white spaces
# Create a Document-Term Matrix: a table of word frequencies per message
dtm <- DocumentTermMatrix(corpus_clean)
# Show the dimensions of the matrix (number of messages × number of unique words)
dim(dtm)
The output for the above step would be:
5572 8305
This means that the Document-Term Matrix (DTM) has:
5572 rows, where each row represents one text message (SMS).
8305 columns, where each column represents a unique word (term) found in the dataset after cleaning.
Step 6: Splitting the Data and Preparing Features for Naive Bayes
Before training the model, we need to split the data into training and testing sets. We’ll use 80% of the messages to train the model and 20% to see how well it performs.
# Split the data into 80% training and 20% testing
set.seed(123) # Setting seed to make the split reproducible
train_index <- createDataPartition(data$label, p = 0.8, list = FALSE)
# Create training and testing Document-Term Matrices
dtm_train <- dtm[train_index, ]
dtm_test <- dtm[-train_index, ]
# Split the labels (spam or ham)
train_labels <- data$label[train_index]
test_labels <- data$label[-train_index]
# Remove sparse terms to reduce dimensions and keep only frequent words (appearing in at least 1% of messages)
dtm_train_freq <- removeSparseTerms(dtm_train, 0.99)
# Define a function to convert word counts into "Yes"/"No" (word present or not)
convert_counts <- function(x) {
y <- ifelse(x > 0, "Yes", "No") # If count > 0, mark "Yes"
y <- factor(y, levels = c("No", "Yes")) # Ensure consistent factor levels
return(y)
}
# Apply the conversion function to training data
train_data <- apply(dtm_train_freq, MARGIN = 2, convert_counts)
# Apply the same transformation to test data using only training vocab
test_data <- apply(dtm_test[, colnames(dtm_train_freq)], MARGIN = 2, convert_counts)
After this step, the text data is now fully prepared for training the Naive Bayes model. It has been cleaned, reduced, and encoded as simple binary features.
Step 7: Training the Naive Bayes Model and Evaluating Performance
After the data is ready to use, we can now train the Naive Bayes classifier using the training set. After training, we'll use the model to predict whether messages in the test set are spam or not. Finally, we’ll check the model’s accuracy and view a confusion matrix to understand how well it performed.
Use this code for the following step:
# Train the Naive Bayes model using the training data
model <- naiveBayes(train_data, train_labels)
# Use the trained model to make predictions on the test data
predictions <- predict(model, test_data)
# Evaluate the model using a confusion matrix
confusion <- confusionMatrix(predictions, test_labels)
# Print the evaluation results (accuracy, precision, recall, etc.)
print(confusion)
The output for the above code will be:
Confusion Matrix and Statistics
Reference
Prediction ham spam
ham 955 37
spam 10 112
Accuracy : 0.9578
95% CI : (0.9443, 0.9688)
No Information Rate : 0.8662
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8028
Mcnemar's Test P-Value : 0.0001491
Sensitivity : 0.9896
Specificity : 0.7517
Pos Pred Value : 0.9627
Neg Pred Value : 0.9180
Prevalence : 0.8662
Detection Rate : 0.8573
Detection Prevalence : 0.8905
Balanced Accuracy : 0.8707
'Positive' Class : ham
Key Metrics Of The Above Output:
Metric
Value
What it Means
Accuracy
95.78%
The model correctly classified ~96% of all messages
Sensitivity
98.96%
The model was excellent at identifying ham messages
Specificity
75.17%
The model was moderately good at catching spam messages
Kappa
0.80
Shows strong agreement between predictions and actual labels
Balanced Accuracy
87.07%
Averages sensitivity and specificity to balance performance on both classes
Step 8: Improving the Model with Laplace Smoothing
In this step, we will try to improve our Naive Bayes model by adding Laplace smoothing. This technique helps avoid zero probabilities for words that don’t appear in the training data but do appear in the test data.
Use this code for the above step:
# Train a Naive Bayes model with Laplace smoothing to avoid zero probabilities
model_laplace <- naiveBayes(train_data, train_labels, laplace = 1)
# Make predictions on the test data using the new model
pred_laplace <- predict(model_laplace, test_data)
# Evaluate the model performance with a confusion matrix
confusionMatrix(pred_laplace, test_labels)
The output for the above code is:
Confusion Matrix and Statistics
Reference
Prediction ham spam
ham 953 36
spam 12 113
Accuracy : 0.9569
95% CI : (0.9433, 0.9681)
No Information Rate : 0.8662
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8005
Mcnemar's Test P-Value : 0.0009009
Sensitivity : 0.9876
Specificity : 0.7584
Pos Pred Value : 0.9636
Neg Pred Value : 0.9040
Prevalence : 0.8662
Detection Rate : 0.8555
Detection Prevalence : 0.8878
Balanced Accuracy : 0.8730
'Positive' Class : ham
Updated Performance Metrics:
Metric
Value
Change
Accuracy
95.69%
Slightly lower
Sensitivity (Ham)
98.76%
Slightly lower
Specificity (Spam)
75.84%
Slightly higher
Kappa
0.8005
Slightly lower
Balanced Accuracy
87.30%
Slightly improved
Adding Laplace smoothing made a small trade-off:
A very slight drop in ham accuracy
A small gain in identifying spam correctly (specificity)
Conclusion
In this Spam Filter project, we built a Naive Bayes classification model using R in Google Colab to detect whether a message is spam or not based on its text content.
We began by cleaning and preprocessing the text data, created a Document-Term Matrix, and then trained the model on 80% of the messages while testing on the remaining 20%. The model's performance was evaluated using a confusion matrix, showing an accuracy of 95.78%.
We further applied Laplace smoothing to improve robustness, which slightly boosted the model’s ability to handle unseen words while maintaining high accuracy and balanced performance.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
1. How does a spam text classification system work using R?
A spam classification system in R uses machine learning to detect unsolicited or harmful messages. It analyzes the content of text messages and classifies them as either "spam" or "ham" using algorithms like Naive Bayes. The process involves data cleaning, text preprocessing, feature extraction, and training a model to accurately identify spam messages.
2. Which tools and R libraries are used in this Spam Filter project?
The main tools and libraries used are:
Google Colab (via R kernel or Jupyter)
R libraries: tm for text mining, e1071 for Naive Bayes, caret for evaluation, stringr and dplyr for data cleaning and manipulation.
3. Why is Naive Bayes suitable for spam detection?
Naive Bayes works well with text data and assumes independence between features (words), making it fast and effective for spam filtering tasks. It also performs well even with limited training data and handles high-dimensional data efficiently.
4. Can I use other algorithms to improve the spam filter?
Yes. Besides Naive Bayes, you can experiment with:
Support Vector Machines (SVM)
Random Forest
Logistic Regression
Gradient Boosting (e.g., XGBoost)
Neural Networks (for more complex filters)
5. What are some similar beginner-friendly machine learning projects in R?
If you enjoyed this project, here are some more beginner-friendly ML projects in R to explore:
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...