Explore Courses
  • Home
  • Blog
  • Spam Filter Project Using R with Naive Bayes – With Code

Spam Filter Project Using R with Naive Bayes – With Code

By Rohit Sharma

Updated on Jul 25, 2025 | 10 min read | 1.22K+ views

Share:

In this Spam Filter Project Using R, we'll be building a spam filter that will classify text messages as spam or not using the Naive Bayes algorithm.

This blog will explain the steps involved in this project, starting from loading and cleaning the dataset to training and evaluating the model.

This project will also teach concepts like text mining, natural language processing, and how to use R packages like tm, e1071, and caret.

Ready to Future-Proof Your Career in Data Science and AI? Learn Generative AI, Machine Learning, and Advanced Analytics with upGrad’s Expert-Led Online Data Science Programs.

Looking for More R Projects? Here Are the Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025 

What Are The Tools and Libraries You'll Be Using

These are the tools and libraries that’ll be used in this project. From R to Naive Bias, these help in ensuring that the Spam Filter Project runs smoothly.

Category

Tool / Library

Purpose

Programming Language R Main language used for data analysis and modeling
Text Mining Library tm For text preprocessing and cleaning
Machine Learning e1071 Implements the Naive Bayes classification algorithm
Model Evaluation caret Used for training/testing split and model evaluation
Data Manipulation dplyr For efficient data handling and transformations
String Handling stringr To work with and manipulate text data
Dataset Format CSV Comma-separated values file containing labeled messages
Algorithm Naive Bayes A probabilistic classifier used for spam detection

Advance Your Career with Industry-Driven Data Science and AI Programs
Gain in-demand skills in Generative AI, Machine Learning, and Advanced Analytics through globally recognized online programs.

What To Know Before Starting This Spam Filter Project?

Before starting this project, it's helpful to have a basic understanding of the following concepts:

  • Text Data Basics: You need to know what text data is and how it's different from numerical data.
  • R Programming: Being familiar with basic R syntax, working with data frames, and loading libraries helps.
  • Data Preprocessing: Knowing the importance of cleaning and transforming text (removing punctuation, stopwords, etc.) is a big plus.
  • Naive Bayes Algorithm: You need to have a basic idea of how probabilistic classification works using word frequency.
  • R Libraries: A little experience with R packages like tm, e1071, and caret for text mining and modeling can be greatly helpful.
background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

What Is The Project Duration, Difficulty, and Skill Level Required 

Here’s a breakdown of the project duration, difficulty, and required skill level for this Spam Filter Project Using Naive Bias:

Aspect

Details

Project Duration 2 to 4 hours (depending on familiarity with R)
Difficulty Level Beginner to Intermediate
Skill Level Required Basic understanding of R, text data, and classification

Must Read: R For Data Science: Why Should You Choose R for Data Science?

Step-by-Step Explanation For This Spam Filter Project Using R and Naive Bias  

The following section will break down the step-by-step process of creating this project and running it in Colab.

Step 1: Setting Up Google Colab for R and Uploading the Dataset

Before we start creating our spam filter project, we need to set up the right tools. We’ll use Google Colab, which is a cloud-based platform that usually runs Python, but we can also run R code by switching the runtime. Download the CSV dataset from platforms like Kaggle and follow these steps.

Here's how to get started:

1. Open Google Colab

2. Change Runtime to R:

  • Go to the top menu: Runtime > Change runtime type
  • In the "Runtime type" dropdown, select R
  • Click Save

Now you're ready to run R code inside Colab.

Step 2: Installing Required R Libraries

After setting up Colab, we will install the R libraries required for this Spam Filter Project. These libraries are the tools that give us functions to clean text, build models, and evaluate results. You only need to install these once. The code for this step is:

# Install only once, skip this step if already installed in your session

install.packages("tm")        # Used for text mining and preprocessing
install.packages("e1071")     # Provides the Naive Bayes algorithm for classification
install.packages("caret")     # Helps with splitting data and evaluating model performance
install.packages("stringr")   # Useful for string operations like pattern matching
install.packages("dplyr")     # Makes it easier to filter, select, and manipulate data

Also Read: 10 Must-Try R Project Ideas for Beginners in 2025!

Step 3: Loading Libraries and Reading the Dataset

In this step, we’ll load the dataset that we have downloaded and read it. Use the following code for this step.

# Load the required libraries into memory
library(tm)         # Text mining
library(e1071)      # Naive Bayes classification
library(caret)      # For data splitting and model evaluation
library(stringr)    # For string operations
library(dplyr)      # For data manipulation

# Read the CSV file containing spam/ham messages
data <- read.csv("SPAM text message 20170820 - Data.csv", stringsAsFactors = FALSE)

# Display the first few rows of the dataset to understand its structure
head(data)

The output for this section will be:

  Category Message
  <chr> <chr>
1 ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
2 ham Ok lar... Joking wif u oni...
3 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
4 ham U dun say so early hor... U c already then say...
5 ham Nah I don't think he goes to usf, he lives around here though
6 spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

Step 4: Exploring and Preparing the Dataset

In this step, we’ll take a better look at the structure of our dataset and prepare it for modeling. Here we will check the column names, rename them to something easier to work with, convert the labels to a categorical format (factor), and finally, inspect how many spam vs. ham messages the dataset has. We’ll use the following code:

# Check the column names in the dataset
colnames(data)

# Rename the columns to "label" (spam/ham) and "text" (message content)
colnames(data) <- c("label", "text")

# Convert the label column to a factor so R knows it represents categories
data$label <- factor(data$label)

# Check how many spam and ham messages are in the dataset
table(data$label)

The output for the above step will be:

'Category' 'Message'

ham spam 

4825  747

Read To Learn More: R vs Python Data Science: The Difference

Step 5: Cleaning the Text and Creating a Document-Term Matrix

In this step, we will convert all text to lowercase. We will remove numbers and punctuation, eliminate common words (like "the", "and"), and clean up extra spaces. After cleaning, we will convert the text into a format that a machine learning model can understand: a Document-Term Matrix (DTM), where rows are messages and columns are words. 

Use the code below for this step:

# Create a corpus (collection of all text messages)
corpus <- VCorpus(VectorSource(data$text))

# Clean the text in multiple steps using the tm_map function
corpus_clean <- corpus %>%
  tm_map(content_transformer(tolower)) %>%      # Convert text to lowercase
  tm_map(removeNumbers) %>%                     # Remove numbers
  tm_map(removePunctuation) %>%                 # Remove punctuation marks
  tm_map(removeWords, stopwords("english")) %>% # Remove common English stopwords (e.g., "the", "and")
  tm_map(stripWhitespace)                       # Remove unnecessary white spaces

# Create a Document-Term Matrix: a table of word frequencies per message
dtm <- DocumentTermMatrix(corpus_clean)

# Show the dimensions of the matrix (number of messages × number of unique words)
dim(dtm)

The output for the above step would be:

5572 8305

This means that the Document-Term Matrix (DTM) has:

  • 5572 rows, where each row represents one text message (SMS).
  • 8305 columns, where each column represents a unique word (term) found in the dataset after cleaning.

Step 6: Splitting the Data and Preparing Features for Naive Bayes

Before training the model, we need to split the data into training and testing sets. We’ll use 80% of the messages to train the model and 20% to see how well it performs. 

# Split the data into 80% training and 20% testing
set.seed(123)  # Setting seed to make the split reproducible
train_index <- createDataPartition(data$label, p = 0.8, list = FALSE)

# Create training and testing Document-Term Matrices
dtm_train <- dtm[train_index, ]
dtm_test  <- dtm[-train_index, ]

# Split the labels (spam or ham)
train_labels <- data$label[train_index]
test_labels  <- data$label[-train_index]

# Remove sparse terms to reduce dimensions and keep only frequent words (appearing in at least 1% of messages)
dtm_train_freq <- removeSparseTerms(dtm_train, 0.99)

# Define a function to convert word counts into "Yes"/"No" (word present or not)
convert_counts <- function(x) {
  y <- ifelse(x > 0, "Yes", "No")          # If count > 0, mark "Yes"
  y <- factor(y, levels = c("No", "Yes"))  # Ensure consistent factor levels
  return(y)
}

# Apply the conversion function to training data
train_data <- apply(dtm_train_freq, MARGIN = 2, convert_counts)

# Apply the same transformation to test data using only training vocab
test_data <- apply(dtm_test[, colnames(dtm_train_freq)], MARGIN = 2, convert_counts)

After this step, the text data is now fully prepared for training the Naive Bayes model. It has been cleaned, reduced, and encoded as simple binary features.

Read This to Learn The: Benefits of Learning R: Why It’s Essential for Data Science

Step 7: Training the Naive Bayes Model and Evaluating Performance

After the data is ready to use, we can now train the Naive Bayes classifier using the training set. After training, we'll use the model to predict whether messages in the test set are spam or not. Finally, we’ll check the model’s accuracy and view a confusion matrix to understand how well it performed.

Use this code for the following step:

# Train the Naive Bayes model using the training data
model <- naiveBayes(train_data, train_labels)

# Use the trained model to make predictions on the test data
predictions <- predict(model, test_data)

# Evaluate the model using a confusion matrix
confusion <- confusionMatrix(predictions, test_labels)

# Print the evaluation results (accuracy, precision, recall, etc.)
print(confusion)

The output for the above code will be:

Confusion Matrix and Statistics

 

          Reference

Prediction ham spam

      ham  955   37

      spam  10  112

                                          

               Accuracy : 0.9578          

                 95% CI : (0.9443, 0.9688)

    No Information Rate : 0.8662          

    P-Value [Acc > NIR] : < 2.2e-16       

                                          

                  Kappa : 0.8028          

                                          

 Mcnemar's Test P-Value : 0.0001491       

                                          

            Sensitivity : 0.9896          

            Specificity : 0.7517          

         Pos Pred Value : 0.9627          

         Neg Pred Value : 0.9180          

             Prevalence : 0.8662          

         Detection Rate : 0.8573          

   Detection Prevalence : 0.8905          

      Balanced Accuracy : 0.8707          

                                          

       'Positive' Class : ham      

Key Metrics Of The Above Output:

Metric

Value

What it Means

Accuracy 95.78% The model correctly classified ~96% of all messages
Sensitivity 98.96% The model was excellent at identifying ham messages
Specificity 75.17% The model was moderately good at catching spam messages
Kappa 0.80 Shows strong agreement between predictions and actual labels
Balanced Accuracy 87.07% Averages sensitivity and specificity to balance performance on both classes

Must Read: The Ultimate R Cheat Sheet for Data Science Enthusiasts

Step 8: Improving the Model with Laplace Smoothing

In this step, we will try to improve our Naive Bayes model by adding Laplace smoothing. This technique helps avoid zero probabilities for words that don’t appear in the training data but do appear in the test data. 

Use this code for the above step:

# Train a Naive Bayes model with Laplace smoothing to avoid zero probabilities
model_laplace <- naiveBayes(train_data, train_labels, laplace = 1)

# Make predictions on the test data using the new model
pred_laplace <- predict(model_laplace, test_data)

# Evaluate the model performance with a confusion matrix
confusionMatrix(pred_laplace, test_labels)

The output for the above code is:

Confusion Matrix and Statistics

 

          Reference

Prediction ham spam

      ham  953   36

      spam  12  113

                                          

               Accuracy : 0.9569          

                 95% CI : (0.9433, 0.9681)

    No Information Rate : 0.8662          

    P-Value [Acc > NIR] : < 2.2e-16       

                                          

                  Kappa : 0.8005          

                                          

 Mcnemar's Test P-Value : 0.0009009       

                                          

            Sensitivity : 0.9876          

            Specificity : 0.7584          

         Pos Pred Value : 0.9636          

         Neg Pred Value : 0.9040          

             Prevalence : 0.8662          

         Detection Rate : 0.8555          

   Detection Prevalence : 0.8878          

      Balanced Accuracy : 0.8730          

                                          

       'Positive' Class : ham             

                             

Updated Performance Metrics:

Metric

Value

Change

Accuracy 95.69% Slightly lower
Sensitivity (Ham) 98.76% Slightly lower
Specificity (Spam) 75.84% Slightly higher
Kappa 0.8005 Slightly lower
Balanced Accuracy 87.30% Slightly improved

Adding Laplace smoothing made a small trade-off:

  • A very slight drop in ham accuracy
  • A small gain in identifying spam correctly (specificity)

Conclusion

In this Spam Filter project, we built a Naive Bayes classification model using R in Google Colab to detect whether a message is spam or not based on its text content.

We began by cleaning and preprocessing the text data, created a Document-Term Matrix, and then trained the model on 80% of the messages while testing on the remaining 20%. The model's performance was evaluated using a confusion matrix, showing an accuracy of 95.78%.

We further applied Laplace smoothing to improve robustness, which slightly boosted the model’s ability to handle unseen words while maintaining high accuracy and balanced performance.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1wGItZ2Yv_LNGDUvqJjlMbV5Kl87ny_lT#scrollTo=i569RoYeEf_J

Frequently Asked Questions (FAQs)

1. How does a spam text classification system work using R?

2. Which tools and R libraries are used in this Spam Filter project?

3. Why is Naive Bayes suitable for spam detection?

4. Can I use other algorithms to improve the spam filter?

5. What are some similar beginner-friendly machine learning projects in R?

Rohit Sharma

803 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months

upGrad Abroad Logo
LinkedinFacebookInstagramTwitterYoutubeWhatsapp

Bachelor programs

Top Destinations

Masters programs

Study Abroad Important Blogs