Home
Blog
Data Science
Email Classification Using Machine Learning and NLP Techniques

Email Classification Using Machine Learning and NLP Techniques

Updated on Aug 05, 2025 | 1.64K+ views

Table of Contents

View all

What Should You Know Beforehand?
Technologies and Libraries Used
Models That Will Be Utilized for Learning
Time Taken and Difficulty Level
How to Build the Email Classification Model
Conclusion

Our inboxes are overflowing with emails every day. Some are promotional, some are important, and many are just plain spam. In addition to being time-consuming, manually sorting through them is dangerous because spam emails frequently contain malware or phishing links. This is where machine learning-based email classification is useful. It automates the process of labeling emails as spam or not spam.

We will develop a machine learning model in this project that can reliably categorize emails according to their content. To make sure our model is trained on a variety of email types, we will use three real-world datasets: SpamAssassin, Enron Spam Subset, and LingSpam.

You will have a better understanding of the inner workings of spam filters by the end of this project. Additionally, you will get practical experience with Python classification models, text vectorization, and natural language processing (NLP).

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.

What Should You Know Beforehand?

It is better to have at least some background in:

Python programming - functions, loops, libraries.
Pandas and NumPy - data loading, cleaning, and manipulation.
Text data basics - how text is processed (removing stopwords or punctuation)
Machine learning basics - how classification works. Especially supervised learning with labeled data.
Train-test split and evaluation metrics - accuracy, precision, recall, and F1-score

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool / Library	Purpose
Python	Core programming language for this project
Pandas	Data loading, cleaning, and manipulation
NumPy	Numerical operations and array handling
Scikit-learn (sklearn)	Building ML models, data splitting, and evaluation metrics
CountVectorizer / TfidfVectorizer	Converting email text into numerical features (vectorization)
Matplotlib / Seaborn	(Optional) Visualizing model performance using plots and charts

Models That Will Be Utilized for Learning

To build an email classification project, we will use the following machine learning models.

Model	Why to Use?
Multinomial Naive Bayes	Well-suited for text classification tasks, such as spam detection. Fast & effective.
Logistic Regression	A strong baseline model for binary classification with interpretable results.
Decision Tree Classifier	Captures decision rules from the data. Easy to comprehend and visualize.
Random Forest Classifier	An ensemble method that improves accuracy by combining multiple decision trees.

Time Taken and Difficulty Level

On average, it will take about 1 to 2 hours to complete. Duration may vary depending on your familiarity with Python & ML concepts. It’s best for beginner-level.

How to Build the Email Classification Model

Let’s start building the project from scratch. We will start by:

Loading and combining the three email datasets
Preprocessing the data (text) for machine learning
Converting emails into numeric features using vectorization
Training multiple classification models
Evaluating and comparing their performance

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the email classification model, we will use the dataset available on Kaggle. Follow the steps mentioned below to download the dataset:

Open a new tab in any web browser.
Type https://www.kaggle.com/code/rohitshirudkar/email-classification-spam-or-ham/input.
On the Email Classification | Spam or Ham page, in the right pane, under the Input section, click completeSpamAssassin.csv.
Click the download icon.
Click enronSpamSubset.csv
Click the download icon.
Click lingSpam.csv.
Click the download icon.
Unzip all the files.

Step 2: Upload and Load the Dataset in Google Colab

Now that the .csv files have been downloaded, let’s upload them to the Colab environment. Use the following code to open a file picker and load the files.

from google.colab import files

uploaded = files.upload()

Once uploaded, import them into Pandas. Here’s the code to do so:

import pandas as pd

# Load each dataset
df1 = pd.read_csv('completeSpamAssassin.csv')
df2 = pd.read_csv('enronSpamSubset.csv')
df3 = pd.read_csv('lingSpam.csv')

# Check shape and preview each
print("SpamAssassin:", df1.shape)
print("Enron:", df2.shape)
print("LingSpam:", df3.shape)

# Optional: display a few rows
df1.head()

Output:

SpamAssassin: (6046, 3)
Enron: (10000, 4)
LingSpam: (2605, 3)

	Unnamed: 0	Body	Label
0	0	\nSave up to 70% on Life Insurance.\nWhy Spend...	1
1	1	1) Fight The Risk of Cancer!\nhttp://www.adcli...	1
2	2	1) Fight The Risk of Cancer!\nhttp://www.adcli...	1
3	3	##############################################...	1
4	4	I thought you might like these:\n1) Slim Down ...	1

What does the output tell us?

completeSpamAssassin.csv has 6,046 rows and 3 columns
enronSpamSubset.csv has 10,000 rows and 4 columns
lingSpam.csv has 2,605 rows and 3 columns
Column names are not consistent across files.

As per the output, we got to know - columns are not standardized, hence we will standardize them before merging. We will extract just the email content and label from each dataset and rename them to match. Here is the code to do so:

# For SpamAssassin (df1)
df1 = df1[['Body', 'Label']].rename(columns={'Body': 'text', 'Label': 'label'})

# For Enron (df2) -- assume 'Body' is the email content
df2 = df2[['Body', 'Label']].rename(columns={'Body': 'text', 'Label': 'label'})

# For LingSpam (df3)
df3 = df3[['Body', 'Label']].rename(columns={'Body': 'text', 'Label': 'label'})

Now let’s combine them into a single dataset, using the code given below:

# Combine all three datasets
df = pd.concat([df1, df2, df3], ignore_index=True)

# Preview the merged data
df.head()

Check the structure using the code given below:

df.info()
df['label'].value_counts()

Output:

RangeIndex: 18651 entries, 0 to 18650

Data columns (total 2 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 text 18650 non-null object

1 label 18651 non-null int64

dtypes: int64(1), object(1)

memory usage: 291.6+ KB

label	count
0	11322
1	7329

dtype: int64

What does the output mean?

The output shows the dataset after merging. We get to know that:

There are in total of 18,651 emails.
11,322 are Ham (0)
7,329 are Spam (1)

Step 3: Preprocess the Data

Before feeding the text to any machine learning model, we need to clean it up. Raw email content often includes unnecessary characters, URLs, numbers, etc. All this can reduce model performance.

To fix this, let’s define a preprocessing function and apply it to the text column. Use the code given below to accomplish the same:

import re
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Set of English stopwords
stop_words = set(stopwords.words('english'))

# Function to clean each email (safe version)
def preprocess_text(text):
    text = str(text)  # Convert to string in case it's float or NaN
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'<.*?>', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\d+', '', text)
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    return " ".join(filtered_words)

# Apply to DataFrame
df['clean_text'] = df['text'].apply(preprocess_text)

Now, let's quickly compare raw vs cleaned text side by side. We will ensure that the preprocessing has removed:

Newlines (\n)
URLs
Numbers
Punctuation
Stopwords
Uppercase letters

Use the code to do this:

df[['text', 'clean_text']].head()

Output:

	text	clean_text
0	\nSave up to 70% on Life Insurance.\nWhy Spend...	save life insurance spend tolife quote savings...
1	1) Fight The Risk of Cancer!\nhttp://www.adcli...	fight risk cancer slim guaranteed lose lbs day...
2	1) Fight The Risk of Cancer!\nhttp://www.adcli...	fight risk cancer slim guaranteed lose lbs day...
3	##############################################...	adult club offers free membership instant acce...
4	I thought you might like these:\n1) Slim Down ...	thought might like slim guaranteed lose lbs da...

From the output, we can see that preprocessing worked perfectly. So let’s move ahead.

Step 4: Vectorize the Cleaned Text

Machine learning models can’t understand text directly. They only work with numbers. Therefore, we will convert clean_text into numerical vectors in this step. We will be using TF-IDF Vectorizer. Here is the code:

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the vectorizer
vectorizer = TfidfVectorizer(max_df=0.7)

# Fit and transform the clean_text column
X = vectorizer.fit_transform(df['clean_text'])

# Labels (spam or not)
y = df['label']

Step 5: Train-Test Split

Now that we have numerical data, let’s split the data to train and test. Use the code given below to do so:

from sklearn.model_selection import train_test_split

# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 6: Train and Evaluate Models

In this step, we will evaluate our machine learning models. We will start by testing four popular classifiers, all in one go. We will also compare their performance on this binary classification problem.

Here is the code:

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Dictionary of models to evaluate
models = {
    "Multinomial Naive Bayes": MultinomialNB(),
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree Classifier": DecisionTreeClassifier(),
    "Random Forest Classifier": RandomForestClassifier()
}

# Store results
results = {}

# Train and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    print(f"\n {name}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    
    results[name] = accuracy_score(y_test, y_pred)

Output:

Multinomial Naive Bayes

Accuracy: 0.9404985258643795

Confusion Matrix:

[[2253 42]

[ 180 1256]]

Classification Report:

precision recall f1-score support

0 0.93 0.98 0.95 2295

1 0.97 0.87 0.92 1436

accuracy 0.94 3731

macro avg 0.95 0.93 0.94 3731

weighted avg 0.94 0.94 0.94 3731

Logistic Regression

Accuracy: 0.9654248190833556

Confusion Matrix:

[[2223 72]

[ 57 1379]]

Classification Report:

precision recall f1-score support

0 0.97 0.97 0.97 2295

1 0.95 0.96 0.96 1436

accuracy 0.97 3731

macro avg 0.96 0.96 0.96 3731

weighted avg 0.97 0.97 0.97 3731

Decision Tree Classifier

Accuracy: 0.9179844545698205

Confusion Matrix:

[[2097 198]

[ 108 1328]]

Classification Report:

precision recall f1-score support

0 0.95 0.91 0.93 2295

1 0.87 0.92 0.90 1436

accuracy 0.92 3731

macro avg 0.91 0.92 0.91 3731

weighted avg 0.92 0.92 0.92 3731

Random Forest Classifier

Accuracy: 0.9664969177164299

Confusion Matrix:

[[2211 84]

[ 41 1395]]

Classification Report:

precision recall f1-score support

0 0.98 0.96 0.97 2295

1 0.94 0.97 0.96 1436

accuracy 0.97 3731

macro avg 0.96 0.97 0.96 3731

weighted avg 0.97 0.97 0.97 3731

Step 7: Final Model Comparison

Let’s compare side by side how each model performed. This comparison will also help you understand the above output.

Model	Accuracy	Precision (Class 1)	Recall (Class 1)	F1-Score (Class 1)	Remarks
Multinomial Naive Bayes	94.04%	0.97	0.87	0.92	Fast and lightweight, slightly lower recall
Logistic Regression	96.54%	0.95	0.96	0.96	High accuracy and balance
Decision Tree Classifier	91.79%	0.87	0.92	0.90	Prone to overfitting, lowest performance
Random Forest Classifier	96.65%	0.94	0.97	0.96	Best overall performer with high reliability

Conclusion

In this project, we built an email Classification System to detect spam and non-spam (ham) emails using machine learning. We trained and tested four models: Multinomial Naive Bayes, Logistic Regression, Decision Tree, and Random Forest.

Random Forest Classifier gave the best results. It achieved 96.65% accuracy, 0.94 precision, and the highest recall of 0.97. Logistic Regression also performed well, with balanced scores and 96.54% accuracy.
Multinomial Naive Bayes was fast and simple, but had slightly lower recall. Decision Tree Classifier showed the weakest performance and was prone to overfitting.

These results show that Random Forest is the most reliable choice for email spam detection.

Popular Data Science Programs

DevOps Full Course Online MS in Data Science MSc in Data Science Program Data Science Advanced Course PGD in Data Science

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/17qNg0jh-jlozEEkDobdeR_iCO5UIzuuJ?usp=sharing

Frequently Asked Questions (FAQs)

1. What is the purpose of Email Classification?

Email Classification helps automatically organize emails by labeling them as spam, primary, promotional, or important using text analysis and machine learning algorithms.

2. How does Email Classification work using NLP?

It uses Natural Language Processing to analyze the content of emails, extract text features, and apply machine learning models to classify emails based on learned patterns.

3. What machine learning models are commonly used?

Naive Bayes is widely used due to its effectiveness with text data. Other models include Random Forests, Logistic Regression, and modern deep learning models like BERT.

4. What data is needed to build an email classification system?

Labeled datasets with examples of different email types (e.g., spam and non-spam) are essential. The Enron email dataset and SpamAssassin are commonly used for training.

5. What are the biggest challenges in Email Classification?

Challenges include handling unstructured text, ambiguous language, constantly evolving spam tactics, and achieving high accuracy while minimizing false positives.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources