Home
Blog
Data Science
Spam Filter Project Using R with Naive Bayes – With Code

Spam Filter Project Using R with Naive Bayes – With Code

Q: 1. How does a spam text classification system work using R?

A spam classification system in R uses machine learning to detect unsolicited or harmful messages. It analyzes the content of text messages and classifies them as either "spam" or "ham" using algorithms like Naive Bayes. The process involves data cleaning, text preprocessing, feature extraction, and training a model to accurately identify spam messages.

Q: 2. Which tools and R libraries are used in this Spam Filter project?

The main tools and libraries used are: Google Colab (via R kernel or Jupyter) R libraries: tm for text mining, e1071 for Naive Bayes, caret for evaluation, stringr and dplyr for data cleaning and manipulation.

Q: 3. Why is Naive Bayes suitable for spam detection?

Naive Bayes works well with text data and assumes independence between features (words), making it fast and effective for spam filtering tasks. It also performs well even with limited training data and handles high-dimensional data efficiently.

Q: 4. Can I use other algorithms to improve the spam filter?

Yes. Besides Naive Bayes, you can experiment with: Support Vector Machines (SVM) Random Forest Logistic Regression Gradient Boosting (e.g., XGBoost) Neural Networks (for more complex filters)

Q: 5. What are some similar beginner-friendly machine learning projects in R?

If you enjoyed this project, here are some more beginner-friendly ML projects in R to explore: Forest Fire Prediction and Analysis House Price Prediction using Regression Mobile App for Lottery Addiction Detection Gender Recognition Using Voice Classification Fake News Detection with Text Classification

By Rohit Sharma

Updated on Jul 25, 2025 | 10 min read | 1.5K+ views

Table of Contents

View all

What Are The Tools and Libraries You'll Be Using
What To Know Before Starting This Spam Filter Project?
What Is The Project Duration, Difficulty, and Skill Level Required
Step-by-Step Explanation For This Spam Filter Project Using R and Naive Bias
Conclusion

In this Spam Filter Project Using R, we'll be building a spam filter that will classify text messages as spam or not using the Naive Bayes algorithm.

This blog will explain the steps involved in this project, starting from loading and cleaning the dataset to training and evaluating the model.

This project will also teach concepts like text mining, natural language processing, and how to use R packages like tm, e1071, and caret.

Ready to Future-Proof Your Career in Data Science and AI? Learn Generative AI, Machine Learning, and Advanced Analytics with upGrad’s Expert-Led Online Data Science Programs.

Looking for More R Projects? Here Are the Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

Popular Data Science Programs

Data Science Advanced Course M Sc in Data Science Degree Cloud Computing Courses Certification Postgraduate Diploma in Data Science MSc in Data Science Program

What Are The Tools and Libraries You'll Be Using

These are the tools and libraries that’ll be used in this project. From R to Naive Bias, these help in ensuring that the Spam Filter Project runs smoothly.

Category	Tool / Library	Purpose
Programming Language	R	Main language used for data analysis and modeling
Text Mining Library	tm	For text preprocessing and cleaning
Machine Learning	e1071	Implements the Naive Bayes classification algorithm
Model Evaluation	caret	Used for training/testing split and model evaluation
Data Manipulation	dplyr	For efficient data handling and transformations
String Handling	stringr	To work with and manipulate text data
Dataset Format	CSV	Comma-separated values file containing labeled messages
Algorithm	Naive Bayes	A probabilistic classifier used for spam detection

Advance Your Career with Industry-Driven Data Science and AI Programs
Gain in-demand skills in Generative AI, Machine Learning, and Advanced Analytics through globally recognized online programs.

What To Know Before Starting This Spam Filter Project?

Before starting this project, it's helpful to have a basic understanding of the following concepts:

Text Data Basics: You need to know what text data is and how it's different from numerical data.
R Programming: Being familiar with basic R syntax, working with data frames, and loading libraries helps.
Data Preprocessing: Knowing the importance of cleaning and transforming text (removing punctuation, stopwords, etc.) is a big plus.
Naive Bayes Algorithm: You need to have a basic idea of how probabilistic classification works using word frequency.
R Libraries: A little experience with R packages like tm, e1071, and caret for text mining and modeling can be greatly helpful.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

What Is The Project Duration, Difficulty, and Skill Level Required

Here’s a breakdown of the project duration, difficulty, and required skill level for this Spam Filter Project Using Naive Bias:

Aspect	Details
Project Duration	2 to 4 hours (depending on familiarity with R)
Difficulty Level	Beginner to Intermediate
Skill Level Required	Basic understanding of R, text data, and classification

Must Read: R For Data Science: Why Should You Choose R for Data Science?

Step-by-Step Explanation For This Spam Filter Project Using R and Naive Bias

The following section will break down the step-by-step process of creating this project and running it in Colab.

Step 1: Setting Up Google Colab for R and Uploading the Dataset

Before we start creating our spam filter project, we need to set up the right tools. We’ll use Google Colab, which is a cloud-based platform that usually runs Python, but we can also run R code by switching the runtime. Download the CSV dataset from platforms like Kaggle and follow these steps.

Here's how to get started:

1. Open Google Colab

2. Change Runtime to R:

Go to the top menu: Runtime > Change runtime type
In the "Runtime type" dropdown, select R
Click Save

Now you're ready to run R code inside Colab.

Step 2: Installing Required R Libraries

After setting up Colab, we will install the R libraries required for this Spam Filter Project. These libraries are the tools that give us functions to clean text, build models, and evaluate results. You only need to install these once. The code for this step is:

# Install only once, skip this step if already installed in your session

install.packages("tm")        # Used for text mining and preprocessing
install.packages("e1071")     # Provides the Naive Bayes algorithm for classification
install.packages("caret")     # Helps with splitting data and evaluating model performance
install.packages("stringr")   # Useful for string operations like pattern matching
install.packages("dplyr")     # Makes it easier to filter, select, and manipulate data

Also Read: 10 Must-Try R Project Ideas for Beginners in 2025!

Step 3: Loading Libraries and Reading the Dataset

In this step, we’ll load the dataset that we have downloaded and read it. Use the following code for this step.

# Load the required libraries into memory
library(tm)         # Text mining
library(e1071)      # Naive Bayes classification
library(caret)      # For data splitting and model evaluation
library(stringr)    # For string operations
library(dplyr)      # For data manipulation

# Read the CSV file containing spam/ham messages
data <- read.csv("SPAM text message 20170820 - Data.csv", stringsAsFactors = FALSE)

# Display the first few rows of the dataset to understand its structure
head(data)

The output for this section will be:

	Category	Message
	<chr>	<chr>
1	ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
2	ham	Ok lar... Joking wif u oni...
3	spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
4	ham	U dun say so early hor... U c already then say...
5	ham	Nah I don't think he goes to usf, he lives around here though
6	spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

Step 4: Exploring and Preparing the Dataset

In this step, we’ll take a better look at the structure of our dataset and prepare it for modeling. Here we will check the column names, rename them to something easier to work with, convert the labels to a categorical format (factor), and finally, inspect how many spam vs. ham messages the dataset has. We’ll use the following code:

# Check the column names in the dataset
colnames(data)

# Rename the columns to "label" (spam/ham) and "text" (message content)
colnames(data) <- c("label", "text")

# Convert the label column to a factor so R knows it represents categories
data$label <- factor(data$label)

# Check how many spam and ham messages are in the dataset
table(data$label)

The output for the above step will be:

'Category' 'Message'

ham spam

4825 747

Read To Learn More: R vs Python Data Science: The Difference

Step 5: Cleaning the Text and Creating a Document-Term Matrix

In this step, we will convert all text to lowercase. We will remove numbers and punctuation, eliminate common words (like "the", "and"), and clean up extra spaces. After cleaning, we will convert the text into a format that a machine learning model can understand: a Document-Term Matrix (DTM), where rows are messages and columns are words.

Use the code below for this step:

# Create a corpus (collection of all text messages)
corpus <- VCorpus(VectorSource(data$text))

# Clean the text in multiple steps using the tm_map function
corpus_clean <- corpus %>%
  tm_map(content_transformer(tolower)) %>%      # Convert text to lowercase
  tm_map(removeNumbers) %>%                     # Remove numbers
  tm_map(removePunctuation) %>%                 # Remove punctuation marks
  tm_map(removeWords, stopwords("english")) %>% # Remove common English stopwords (e.g., "the", "and")
  tm_map(stripWhitespace)                       # Remove unnecessary white spaces

# Create a Document-Term Matrix: a table of word frequencies per message
dtm <- DocumentTermMatrix(corpus_clean)

# Show the dimensions of the matrix (number of messages × number of unique words)
dim(dtm)

The output for the above step would be:

5572 8305

This means that the Document-Term Matrix (DTM) has:

5572 rows, where each row represents one text message (SMS).
8305 columns, where each column represents a unique word (term) found in the dataset after cleaning.

Step 6: Splitting the Data and Preparing Features for Naive Bayes

Before training the model, we need to split the data into training and testing sets. We’ll use 80% of the messages to train the model and 20% to see how well it performs.

# Split the data into 80% training and 20% testing
set.seed(123)  # Setting seed to make the split reproducible
train_index <- createDataPartition(data$label, p = 0.8, list = FALSE)

# Create training and testing Document-Term Matrices
dtm_train <- dtm[train_index, ]
dtm_test  <- dtm[-train_index, ]

# Split the labels (spam or ham)
train_labels <- data$label[train_index]
test_labels  <- data$label[-train_index]

# Remove sparse terms to reduce dimensions and keep only frequent words (appearing in at least 1% of messages)
dtm_train_freq <- removeSparseTerms(dtm_train, 0.99)

# Define a function to convert word counts into "Yes"/"No" (word present or not)
convert_counts <- function(x) {
  y <- ifelse(x > 0, "Yes", "No")          # If count > 0, mark "Yes"
  y <- factor(y, levels = c("No", "Yes"))  # Ensure consistent factor levels
  return(y)
}

# Apply the conversion function to training data
train_data <- apply(dtm_train_freq, MARGIN = 2, convert_counts)

# Apply the same transformation to test data using only training vocab
test_data <- apply(dtm_test[, colnames(dtm_train_freq)], MARGIN = 2, convert_counts)

After this step, the text data is now fully prepared for training the Naive Bayes model. It has been cleaned, reduced, and encoded as simple binary features.

Read This to Learn The: Benefits of Learning R: Why It’s Essential for Data Science

Step 7: Training the Naive Bayes Model and Evaluating Performance

After the data is ready to use, we can now train the Naive Bayes classifier using the training set. After training, we'll use the model to predict whether messages in the test set are spam or not. Finally, we’ll check the model’s accuracy and view a confusion matrix to understand how well it performed.

Use this code for the following step:

# Train the Naive Bayes model using the training data
model <- naiveBayes(train_data, train_labels)

# Use the trained model to make predictions on the test data
predictions <- predict(model, test_data)

# Evaluate the model using a confusion matrix
confusion <- confusionMatrix(predictions, test_labels)

# Print the evaluation results (accuracy, precision, recall, etc.)
print(confusion)

The output for the above code will be:

Confusion Matrix and Statistics

Reference

Prediction ham spam

ham 955 37

spam 10 112

Accuracy : 0.9578

95% CI : (0.9443, 0.9688)

No Information Rate : 0.8662

P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.8028

Mcnemar's Test P-Value : 0.0001491

Sensitivity : 0.9896

Specificity : 0.7517

Pos Pred Value : 0.9627

Neg Pred Value : 0.9180

Prevalence : 0.8662

Detection Rate : 0.8573

Detection Prevalence : 0.8905

Balanced Accuracy : 0.8707

'Positive' Class : ham

Key Metrics Of The Above Output:

Metric	Value	What it Means
Accuracy	95.78%	The model correctly classified ~96% of all messages
Sensitivity	98.96%	The model was excellent at identifying ham messages
Specificity	75.17%	The model was moderately good at catching spam messages
Kappa	0.80	Shows strong agreement between predictions and actual labels
Balanced Accuracy	87.07%	Averages sensitivity and specificity to balance performance on both classes

Must Read: The Ultimate R Cheat Sheet for Data Science Enthusiasts

Step 8: Improving the Model with Laplace Smoothing

In this step, we will try to improve our Naive Bayes model by adding Laplace smoothing. This technique helps avoid zero probabilities for words that don’t appear in the training data but do appear in the test data.

Use this code for the above step:

# Train a Naive Bayes model with Laplace smoothing to avoid zero probabilities
model_laplace <- naiveBayes(train_data, train_labels, laplace = 1)

# Make predictions on the test data using the new model
pred_laplace <- predict(model_laplace, test_data)

# Evaluate the model performance with a confusion matrix
confusionMatrix(pred_laplace, test_labels)

The output for the above code is:

Confusion Matrix and Statistics

Reference

Prediction ham spam

ham 953 36

spam 12 113

Accuracy : 0.9569

95% CI : (0.9433, 0.9681)

No Information Rate : 0.8662

P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.8005

Mcnemar's Test P-Value : 0.0009009

Sensitivity : 0.9876

Specificity : 0.7584

Pos Pred Value : 0.9636

Neg Pred Value : 0.9040

Prevalence : 0.8662

Detection Rate : 0.8555

Detection Prevalence : 0.8878

Balanced Accuracy : 0.8730

'Positive' Class : ham

Updated Performance Metrics:

Metric	Value	Change
Accuracy	95.69%	Slightly lower
Sensitivity (Ham)	98.76%	Slightly lower
Specificity (Spam)	75.84%	Slightly higher
Kappa	0.8005	Slightly lower
Balanced Accuracy	87.30%	Slightly improved

Adding Laplace smoothing made a small trade-off:

A very slight drop in ham accuracy
A small gain in identifying spam correctly (specificity)

Conclusion

In this Spam Filter project, we built a Naive Bayes classification model using R in Google Colab to detect whether a message is spam or not based on its text content.

We began by cleaning and preprocessing the text data, created a Document-Term Matrix, and then trained the model on 80% of the messages while testing on the remaining 20%. The model's performance was evaluated using a confusion matrix, showing an accuracy of 95.78%.

We further applied Laplace smoothing to improve robustness, which slightly boosted the model’s ability to handle unseen words while maintaining high accuracy and balanced performance.

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1wGItZ2Yv_LNGDUvqJjlMbV5Kl87ny_lT#scrollTo=i569RoYeEf_J

Frequently Asked Questions (FAQs)

1. How does a spam text classification system work using R?

A spam classification system in R uses machine learning to detect unsolicited or harmful messages. It analyzes the content of text messages and classifies them as either "spam" or "ham" using algorithms like Naive Bayes. The process involves data cleaning, text preprocessing, feature extraction, and training a model to accurately identify spam messages.

2. Which tools and R libraries are used in this Spam Filter project?

The main tools and libraries used are:

Google Colab (via R kernel or Jupyter)
R libraries: tm for text mining, e1071 for Naive Bayes, caret for evaluation, stringr and dplyr for data cleaning and manipulation.

3. Why is Naive Bayes suitable for spam detection?

Naive Bayes works well with text data and assumes independence between features (words), making it fast and effective for spam filtering tasks. It also performs well even with limited training data and handles high-dimensional data efficiently.

4. Can I use other algorithms to improve the spam filter?

Yes. Besides Naive Bayes, you can experiment with:

Support Vector Machines (SVM)
Random Forest
Logistic Regression
Gradient Boosting (e.g., XGBoost)

Neural Networks (for more complex filters)

5. What are some similar beginner-friendly machine learning projects in R?

If you enjoyed this project, here are some more beginner-friendly ML projects in R to explore:

Forest Fire Prediction and Analysis
House Price Prediction using Regression
Mobile App for Lottery Addiction Detection
Gender Recognition Using Voice Classification
Fake News Detection with Text Classification

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources