Spam Filter Project Using R with Naive Bayes – With Code
By Rohit Sharma
Updated on Jul 25, 2025 | 10 min read | 1.33K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 25, 2025 | 10 min read | 1.33K+ views
Share:
Table of Contents
In this Spam Filter Project Using R, we'll be building a spam filter that will classify text messages as spam or not using the Naive Bayes algorithm.
This blog will explain the steps involved in this project, starting from loading and cleaning the dataset to training and evaluating the model.
This project will also teach concepts like text mining, natural language processing, and how to use R packages like tm, e1071, and caret.
Ready to Future-Proof Your Career in Data Science and AI? Learn Generative AI, Machine Learning, and Advanced Analytics with upGrad’s Expert-Led Online Data Science Programs.
Looking for More R Projects? Here Are the Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025
Popular Data Science Programs
These are the tools and libraries that’ll be used in this project. From R to Naive Bias, these help in ensuring that the Spam Filter Project runs smoothly.
Category |
Tool / Library |
Purpose |
Programming Language | R | Main language used for data analysis and modeling |
Text Mining Library | tm | For text preprocessing and cleaning |
Machine Learning | e1071 | Implements the Naive Bayes classification algorithm |
Model Evaluation | caret | Used for training/testing split and model evaluation |
Data Manipulation | dplyr | For efficient data handling and transformations |
String Handling | stringr | To work with and manipulate text data |
Dataset Format | CSV | Comma-separated values file containing labeled messages |
Algorithm | Naive Bayes | A probabilistic classifier used for spam detection |
Advance Your Career with Industry-Driven Data Science and AI Programs
Gain in-demand skills in Generative AI, Machine Learning, and Advanced Analytics through globally recognized online programs.
Before starting this project, it's helpful to have a basic understanding of the following concepts:
Data Science Courses to upskill
Explore Data Science Courses for Career Progression
Here’s a breakdown of the project duration, difficulty, and required skill level for this Spam Filter Project Using Naive Bias:
Aspect |
Details |
Project Duration | 2 to 4 hours (depending on familiarity with R) |
Difficulty Level | Beginner to Intermediate |
Skill Level Required | Basic understanding of R, text data, and classification |
Must Read: R For Data Science: Why Should You Choose R for Data Science?
The following section will break down the step-by-step process of creating this project and running it in Colab.
Before we start creating our spam filter project, we need to set up the right tools. We’ll use Google Colab, which is a cloud-based platform that usually runs Python, but we can also run R code by switching the runtime. Download the CSV dataset from platforms like Kaggle and follow these steps.
Here's how to get started:
1. Open Google Colab
2. Change Runtime to R:
Now you're ready to run R code inside Colab.
After setting up Colab, we will install the R libraries required for this Spam Filter Project. These libraries are the tools that give us functions to clean text, build models, and evaluate results. You only need to install these once. The code for this step is:
# Install only once, skip this step if already installed in your session
install.packages("tm") # Used for text mining and preprocessing
install.packages("e1071") # Provides the Naive Bayes algorithm for classification
install.packages("caret") # Helps with splitting data and evaluating model performance
install.packages("stringr") # Useful for string operations like pattern matching
install.packages("dplyr") # Makes it easier to filter, select, and manipulate data
Also Read: 10 Must-Try R Project Ideas for Beginners in 2025!
In this step, we’ll load the dataset that we have downloaded and read it. Use the following code for this step.
# Load the required libraries into memory
library(tm) # Text mining
library(e1071) # Naive Bayes classification
library(caret) # For data splitting and model evaluation
library(stringr) # For string operations
library(dplyr) # For data manipulation
# Read the CSV file containing spam/ham messages
data <- read.csv("SPAM text message 20170820 - Data.csv", stringsAsFactors = FALSE)
# Display the first few rows of the dataset to understand its structure
head(data)
The output for this section will be:
Category | Message | |
<chr> | <chr> | |
1 | ham | Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... |
2 | ham | Ok lar... Joking wif u oni... |
3 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's |
4 | ham | U dun say so early hor... U c already then say... |
5 | ham | Nah I don't think he goes to usf, he lives around here though |
6 | spam | FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv |
In this step, we’ll take a better look at the structure of our dataset and prepare it for modeling. Here we will check the column names, rename them to something easier to work with, convert the labels to a categorical format (factor), and finally, inspect how many spam vs. ham messages the dataset has. We’ll use the following code:
# Check the column names in the dataset
colnames(data)
# Rename the columns to "label" (spam/ham) and "text" (message content)
colnames(data) <- c("label", "text")
# Convert the label column to a factor so R knows it represents categories
data$label <- factor(data$label)
# Check how many spam and ham messages are in the dataset
table(data$label)
The output for the above step will be:
'Category' 'Message' ham spam 4825 747 |
Read To Learn More: R vs Python Data Science: The Difference
In this step, we will convert all text to lowercase. We will remove numbers and punctuation, eliminate common words (like "the", "and"), and clean up extra spaces. After cleaning, we will convert the text into a format that a machine learning model can understand: a Document-Term Matrix (DTM), where rows are messages and columns are words.
Use the code below for this step:
# Create a corpus (collection of all text messages)
corpus <- VCorpus(VectorSource(data$text))
# Clean the text in multiple steps using the tm_map function
corpus_clean <- corpus %>%
tm_map(content_transformer(tolower)) %>% # Convert text to lowercase
tm_map(removeNumbers) %>% # Remove numbers
tm_map(removePunctuation) %>% # Remove punctuation marks
tm_map(removeWords, stopwords("english")) %>% # Remove common English stopwords (e.g., "the", "and")
tm_map(stripWhitespace) # Remove unnecessary white spaces
# Create a Document-Term Matrix: a table of word frequencies per message
dtm <- DocumentTermMatrix(corpus_clean)
# Show the dimensions of the matrix (number of messages × number of unique words)
dim(dtm)
The output for the above step would be:
5572 8305 |
This means that the Document-Term Matrix (DTM) has:
Before training the model, we need to split the data into training and testing sets. We’ll use 80% of the messages to train the model and 20% to see how well it performs.
# Split the data into 80% training and 20% testing
set.seed(123) # Setting seed to make the split reproducible
train_index <- createDataPartition(data$label, p = 0.8, list = FALSE)
# Create training and testing Document-Term Matrices
dtm_train <- dtm[train_index, ]
dtm_test <- dtm[-train_index, ]
# Split the labels (spam or ham)
train_labels <- data$label[train_index]
test_labels <- data$label[-train_index]
# Remove sparse terms to reduce dimensions and keep only frequent words (appearing in at least 1% of messages)
dtm_train_freq <- removeSparseTerms(dtm_train, 0.99)
# Define a function to convert word counts into "Yes"/"No" (word present or not)
convert_counts <- function(x) {
y <- ifelse(x > 0, "Yes", "No") # If count > 0, mark "Yes"
y <- factor(y, levels = c("No", "Yes")) # Ensure consistent factor levels
return(y)
}
# Apply the conversion function to training data
train_data <- apply(dtm_train_freq, MARGIN = 2, convert_counts)
# Apply the same transformation to test data using only training vocab
test_data <- apply(dtm_test[, colnames(dtm_train_freq)], MARGIN = 2, convert_counts)
After this step, the text data is now fully prepared for training the Naive Bayes model. It has been cleaned, reduced, and encoded as simple binary features.
Read This to Learn The: Benefits of Learning R: Why It’s Essential for Data Science
After the data is ready to use, we can now train the Naive Bayes classifier using the training set. After training, we'll use the model to predict whether messages in the test set are spam or not. Finally, we’ll check the model’s accuracy and view a confusion matrix to understand how well it performed.
Use this code for the following step:
# Train the Naive Bayes model using the training data
model <- naiveBayes(train_data, train_labels)
# Use the trained model to make predictions on the test data
predictions <- predict(model, test_data)
# Evaluate the model using a confusion matrix
confusion <- confusionMatrix(predictions, test_labels)
# Print the evaluation results (accuracy, precision, recall, etc.)
print(confusion)
The output for the above code will be:
Confusion Matrix and Statistics
Reference Prediction ham spam ham 955 37 spam 10 112
Accuracy : 0.9578 95% CI : (0.9443, 0.9688) No Information Rate : 0.8662 P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8028
Mcnemar's Test P-Value : 0.0001491
Sensitivity : 0.9896 Specificity : 0.7517 Pos Pred Value : 0.9627 Neg Pred Value : 0.9180 Prevalence : 0.8662 Detection Rate : 0.8573 Detection Prevalence : 0.8905 Balanced Accuracy : 0.8707
'Positive' Class : ham |
Key Metrics Of The Above Output:
Metric |
Value |
What it Means |
Accuracy | 95.78% | The model correctly classified ~96% of all messages |
Sensitivity | 98.96% | The model was excellent at identifying ham messages |
Specificity | 75.17% | The model was moderately good at catching spam messages |
Kappa | 0.80 | Shows strong agreement between predictions and actual labels |
Balanced Accuracy | 87.07% | Averages sensitivity and specificity to balance performance on both classes |
Must Read: The Ultimate R Cheat Sheet for Data Science Enthusiasts
In this step, we will try to improve our Naive Bayes model by adding Laplace smoothing. This technique helps avoid zero probabilities for words that don’t appear in the training data but do appear in the test data.
Use this code for the above step:
# Train a Naive Bayes model with Laplace smoothing to avoid zero probabilities
model_laplace <- naiveBayes(train_data, train_labels, laplace = 1)
# Make predictions on the test data using the new model
pred_laplace <- predict(model_laplace, test_data)
# Evaluate the model performance with a confusion matrix
confusionMatrix(pred_laplace, test_labels)
The output for the above code is:
Confusion Matrix and Statistics
Reference Prediction ham spam ham 953 36 spam 12 113
Accuracy : 0.9569 95% CI : (0.9433, 0.9681) No Information Rate : 0.8662 P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8005
Mcnemar's Test P-Value : 0.0009009
Sensitivity : 0.9876 Specificity : 0.7584 Pos Pred Value : 0.9636 Neg Pred Value : 0.9040 Prevalence : 0.8662 Detection Rate : 0.8555 Detection Prevalence : 0.8878 Balanced Accuracy : 0.8730
'Positive' Class : ham
|
Updated Performance Metrics:
Metric |
Value |
Change |
Accuracy | 95.69% | Slightly lower |
Sensitivity (Ham) | 98.76% | Slightly lower |
Specificity (Spam) | 75.84% | Slightly higher |
Kappa | 0.8005 | Slightly lower |
Balanced Accuracy | 87.30% | Slightly improved |
Adding Laplace smoothing made a small trade-off:
In this Spam Filter project, we built a Naive Bayes classification model using R in Google Colab to detect whether a message is spam or not based on its text content.
We began by cleaning and preprocessing the text data, created a Document-Term Matrix, and then trained the model on 80% of the messages while testing on the remaining 20%. The model's performance was evaluated using a confusion matrix, showing an accuracy of 95.78%.
We further applied Laplace smoothing to improve robustness, which slightly boosted the model’s ability to handle unseen words while maintaining high accuracy and balanced performance.
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1wGItZ2Yv_LNGDUvqJjlMbV5Kl87ny_lT#scrollTo=i569RoYeEf_J
A spam classification system in R uses machine learning to detect unsolicited or harmful messages. It analyzes the content of text messages and classifies them as either "spam" or "ham" using algorithms like Naive Bayes. The process involves data cleaning, text preprocessing, feature extraction, and training a model to accurately identify spam messages.
The main tools and libraries used are:
Naive Bayes works well with text data and assumes independence between features (words), making it fast and effective for spam filtering tasks. It also performs well even with limited training data and handles high-dimensional data efficiently.
Yes. Besides Naive Bayes, you can experiment with:
Neural Networks (for more complex filters)
If you enjoyed this project, here are some more beginner-friendly ML projects in R to explore:
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources