Email Classification Using Machine Learning and NLP Techniques
By Rohit Sharma
Updated on Jul 30, 2025 | 1.28K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 30, 2025 | 1.28K+ views
Share:
Table of Contents
Our inboxes are overflowing with emails every day. Some are promotional, some are important, and many are just plain spam. In addition to being time-consuming, manually sorting through them is dangerous because spam emails frequently contain malware or phishing links. This is where machine learning-based email classification is useful. It automates the process of labeling emails as spam or not spam.
We will develop a machine learning model in this project that can reliably categorize emails according to their content. To make sure our model is trained on a variety of email types, we will use three real-world datasets: SpamAssassin, Enron Spam Subset, and LingSpam.
You will have a better understanding of the inner workings of spam filters by the end of this project. Additionally, you will get practical experience with Python classification models, text vectorization, and natural language processing (NLP).
Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.
It is better to have at least some background in:
For this project, the following tools and libraries will be used:
Tool / Library |
Purpose |
Python |
Core programming language for this project |
Pandas |
Data loading, cleaning, and manipulation |
NumPy |
Numerical operations and array handling |
Scikit-learn (sklearn) |
Building ML models, data splitting, and evaluation metrics |
CountVectorizer / TfidfVectorizer |
Converting email text into numerical features (vectorization) |
Matplotlib / Seaborn |
(Optional) Visualizing model performance using plots and charts |
To build an email classification project, we will use the following machine learning models.
Model |
Why to Use? |
Well-suited for text classification tasks, such as spam detection. Fast & effective. |
|
A strong baseline model for binary classification with interpretable results. |
|
Captures decision rules from the data. Easy to comprehend and visualize. |
|
An ensemble method that improves accuracy by combining multiple decision trees. |
On average, it will take about 1 to 2 hours to complete. Duration may vary depending on your familiarity with Python & ML concepts. It’s best for beginner-level.
Let’s start building the project from scratch. We will start by:
Without any further delay, let’s start!
To build the email classification model, we will use the dataset available on Kaggle. Follow the steps mentioned below to download the dataset:
Now that the .csv files have been downloaded, let’s upload them to the Colab environment. Use the following code to open a file picker and load the files.
from google.colab import files
uploaded = files.upload()
Once uploaded, import them into Pandas. Here’s the code to do so:
import pandas as pd
# Load each dataset
df1 = pd.read_csv('completeSpamAssassin.csv')
df2 = pd.read_csv('enronSpamSubset.csv')
df3 = pd.read_csv('lingSpam.csv')
# Check shape and preview each
print("SpamAssassin:", df1.shape)
print("Enron:", df2.shape)
print("LingSpam:", df3.shape)
# Optional: display a few rows
df1.head()
Output:
SpamAssassin: (6046, 3)
Enron: (10000, 4)
LingSpam: (2605, 3)
Unnamed: 0 | Body | Label | |
0 |
0 | \nSave up to 70% on Life Insurance.\nWhy Spend... | 1 |
1 |
1 | 1) Fight The Risk of Cancer!\nhttp://www.adcli... | 1 |
2 |
2 | 1) Fight The Risk of Cancer!\nhttp://www.adcli... | 1 |
3 |
3 | ##############################################... | 1 |
4 |
4 | I thought you might like these:\n1) Slim Down ... | 1 |
What does the output tell us?
As per the output, we got to know - columns are not standardized, hence we will standardize them before merging. We will extract just the email content and label from each dataset and rename them to match. Here is the code to do so:
# For SpamAssassin (df1)
df1 = df1[['Body', 'Label']].rename(columns={'Body': 'text', 'Label': 'label'})
# For Enron (df2) -- assume 'Body' is the email content
df2 = df2[['Body', 'Label']].rename(columns={'Body': 'text', 'Label': 'label'})
# For LingSpam (df3)
df3 = df3[['Body', 'Label']].rename(columns={'Body': 'text', 'Label': 'label'})
Now let’s combine them into a single dataset, using the code given below:
# Combine all three datasets
df = pd.concat([df1, df2, df3], ignore_index=True)
# Preview the merged data
df.head()
Check the structure using the code given below:
df.info()
df['label'].value_counts()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18651 entries, 0 to 18650
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 18650 non-null object
1 label 18651 non-null int64
dtypes: int64(1), object(1)
memory usage: 291.6+ KB
label |
count |
0 |
11322 |
1 |
7329 |
dtype: int64
What does the output mean?
The output shows the dataset after merging. We get to know that:
Before feeding the text to any machine learning model, we need to clean it up. Raw email content often includes unnecessary characters, URLs, numbers, etc. All this can reduce model performance.
To fix this, let’s define a preprocessing function and apply it to the text column. Use the code given below to accomplish the same:
import re
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
# Set of English stopwords
stop_words = set(stopwords.words('english'))
# Function to clean each email (safe version)
def preprocess_text(text):
text = str(text) # Convert to string in case it's float or NaN
text = text.lower()
text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
text = re.sub(r'<.*?>', '', text)
text = text.translate(str.maketrans('', '', string.punctuation))
text = re.sub(r'\d+', '', text)
words = text.split()
filtered_words = [word for word in words if word not in stop_words]
return " ".join(filtered_words)
# Apply to DataFrame
df['clean_text'] = df['text'].apply(preprocess_text)
Now, let's quickly compare raw vs cleaned text side by side. We will ensure that the preprocessing has removed:
Use the code to do this:
df[['text', 'clean_text']].head()
Output:
text | clean_text | |
0 |
\nSave up to 70% on Life Insurance.\nWhy Spend... | save life insurance spend tolife quote savings... |
1 |
1) Fight The Risk of Cancer!\nhttp://www.adcli... | fight risk cancer slim guaranteed lose lbs day... |
2 |
1) Fight The Risk of Cancer!\nhttp://www.adcli... | fight risk cancer slim guaranteed lose lbs day... |
3 |
##############################################... | adult club offers free membership instant acce... |
4 |
I thought you might like these:\n1) Slim Down ... | thought might like slim guaranteed lose lbs da... |
From the output, we can see that preprocessing worked perfectly. So let’s move ahead.
Machine learning models can’t understand text directly. They only work with numbers. Therefore, we will convert clean_text into numerical vectors in this step. We will be using TF-IDF Vectorizer. Here is the code:
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize the vectorizer
vectorizer = TfidfVectorizer(max_df=0.7)
# Fit and transform the clean_text column
X = vectorizer.fit_transform(df['clean_text'])
# Labels (spam or not)
y = df['label']
Now that we have numerical data, let’s split the data to train and test. Use the code given below to do so:
from sklearn.model_selection import train_test_split
# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In this step, we will evaluate our machine learning models. We will start by testing four popular classifiers, all in one go. We will also compare their performance on this binary classification problem.
Here is the code:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Dictionary of models to evaluate
models = {
"Multinomial Naive Bayes": MultinomialNB(),
"Logistic Regression": LogisticRegression(max_iter=1000),
"Decision Tree Classifier": DecisionTreeClassifier(),
"Random Forest Classifier": RandomForestClassifier()
}
# Store results
results = {}
# Train and evaluate each model
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"\n {name}")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
results[name] = accuracy_score(y_test, y_pred)
Output:
Multinomial Naive Bayes
Accuracy: 0.9404985258643795
Confusion Matrix:
[[2253 42]
[ 180 1256]]
Classification Report:
precision recall f1-score support
0 0.93 0.98 0.95 2295
1 0.97 0.87 0.92 1436
accuracy 0.94 3731
macro avg 0.95 0.93 0.94 3731
weighted avg 0.94 0.94 0.94 3731
Logistic Regression
Accuracy: 0.9654248190833556
Confusion Matrix:
[[2223 72]
[ 57 1379]]
Classification Report:
precision recall f1-score support
0 0.97 0.97 0.97 2295
1 0.95 0.96 0.96 1436
accuracy 0.97 3731
macro avg 0.96 0.96 0.96 3731
weighted avg 0.97 0.97 0.97 3731
Decision Tree Classifier
Accuracy: 0.9179844545698205
Confusion Matrix:
[[2097 198]
[ 108 1328]]
Classification Report:
precision recall f1-score support
0 0.95 0.91 0.93 2295
1 0.87 0.92 0.90 1436
accuracy 0.92 3731
macro avg 0.91 0.92 0.91 3731
weighted avg 0.92 0.92 0.92 3731
Random Forest Classifier
Accuracy: 0.9664969177164299
Confusion Matrix:
[[2211 84]
[ 41 1395]]
Classification Report:
precision recall f1-score support
0 0.98 0.96 0.97 2295
1 0.94 0.97 0.96 1436
accuracy 0.97 3731
macro avg 0.96 0.97 0.96 3731
weighted avg 0.97 0.97 0.97 3731
Let’s compare side by side how each model performed. This comparison will also help you understand the above output.
Model |
Accuracy |
Precision (Class 1) |
Recall (Class 1) |
F1-Score (Class 1) |
Remarks |
Multinomial Naive Bayes |
94.04% |
0.97 |
0.87 |
0.92 |
Fast and lightweight, slightly lower recall |
Logistic Regression |
96.54% |
0.95 |
0.96 |
0.96 |
High accuracy and balance |
Decision Tree Classifier |
91.79% |
0.87 |
0.92 |
0.90 |
Prone to overfitting, lowest performance |
Random Forest Classifier |
96.65% |
0.94 |
0.97 |
0.96 |
Best overall performer with high reliability |
In this project, we built an email Classification System to detect spam and non-spam (ham) emails using machine learning. We trained and tested four models: Multinomial Naive Bayes, Logistic Regression, Decision Tree, and Random Forest.
Random Forest Classifier gave the best results. It achieved 96.65% accuracy, 0.94 precision, and the highest recall of 0.97. Logistic Regression also performed well, with balanced scores and 96.54% accuracy.
Multinomial Naive Bayes was fast and simple, but had slightly lower recall. Decision Tree Classifier showed the weakest performance and was prone to overfitting.
These results show that Random Forest is the most reliable choice for email spam detection.
Popular Data Science Programs
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/17qNg0jh-jlozEEkDobdeR_iCO5UIzuuJ?usp=sharing
802 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources