Email Classification Using Machine Learning and NLP Techniques

By Rohit Sharma

Updated on Jul 30, 2025 | 1.28K+ views

Share:

Our inboxes are overflowing with emails every day. Some are promotional, some are important, and many are just plain spam. In addition to being time-consuming, manually sorting through them is dangerous because spam emails frequently contain malware or phishing links. This is where machine learning-based email classification is useful. It automates the process of labeling emails as spam or not spam.

We will develop a machine learning model in this project that can reliably categorize emails according to their content. To make sure our model is trained on a variety of email types, we will use three real-world datasets: SpamAssassin, Enron Spam Subset, and LingSpam. 

You will have a better understanding of the inner workings of spam filters by the end of this project. Additionally, you will get practical experience with Python classification models, text vectorization, and natural language processing (NLP).

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog. 

What Should You Know Beforehand?

It is better to have at least some background in:

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool / Library

Purpose

Python

Core programming language for this project

Pandas

Data loading, cleaning, and manipulation

NumPy

Numerical operations and array handling

Scikit-learn (sklearn)

Building ML models, data splitting, and evaluation metrics

CountVectorizer / TfidfVectorizer

Converting email text into numerical features (vectorization)

Matplotlib / Seaborn

(Optional) Visualizing model performance using plots and charts

Models That Will Be Utilized for Learning

To build an email classification project, we will use the following machine learning models. 

Model

Why to Use? 

Multinomial Naive Bayes

Well-suited for text classification tasks, such as spam detection. Fast & effective.

Logistic Regression

A strong baseline model for binary classification with interpretable results.

Decision Tree Classifier

Captures decision rules from the data. Easy to comprehend and visualize.

Random Forest Classifier

An ensemble method that improves accuracy by combining multiple decision trees.

Time Taken and Difficulty Level

On average, it will take about 1 to 2 hours to complete. Duration may vary depending on your familiarity with Python & ML concepts. It’s best for beginner-level.

How to Build the Email Classification Model

Let’s start building the project from scratch. We will start by:

  • Loading and combining the three email datasets
  • Preprocessing the data (text) for machine learning
  • Converting emails into numeric features using vectorization
  • Training multiple classification models
  • Evaluating and comparing their performance

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the email classification model, we will use the dataset available on Kaggle. Follow the steps mentioned below to download the dataset:

  1. Open a new tab in any web browser. 
  2. Type https://www.kaggle.com/code/rohitshirudkar/email-classification-spam-or-ham/input.
  3. On the Email Classification | Spam or Ham page, in the right pane, under the Input section, click completeSpamAssassin.csv
  4. Click the download icon.
  5. Click enronSpamSubset.csv 
  6. Click the download icon.
  7. Click lingSpam.csv.
  8. Click the download icon.
  9. Unzip all the files. 

Step 2: Upload and Load the Dataset in Google Colab

Now that the .csv files have been downloaded, let’s upload them to the Colab environment. Use the following code to open a file picker and load the files. 

from google.colab import files

uploaded = files.upload()

Once uploaded, import them into Pandas. Here’s the code to do so:

import pandas as pd

# Load each dataset
df1 = pd.read_csv('completeSpamAssassin.csv')
df2 = pd.read_csv('enronSpamSubset.csv')
df3 = pd.read_csv('lingSpam.csv')

# Check shape and preview each
print("SpamAssassin:", df1.shape)
print("Enron:", df2.shape)
print("LingSpam:", df3.shape)

# Optional: display a few rows
df1.head()

Output:

SpamAssassin: (6046, 3)
Enron: (10000, 4)
LingSpam: (2605, 3)

  Unnamed: 0 Body Label

0

0 \nSave up to 70% on Life Insurance.\nWhy Spend... 1

1

1 1) Fight The Risk of Cancer!\nhttp://www.adcli... 1

2

2 1) Fight The Risk of Cancer!\nhttp://www.adcli... 1

3

3 ##############################################... 1

4

4 I thought you might like these:\n1) Slim Down ... 1


What does the output tell us?

  • completeSpamAssassin.csv has 6,046 rows and 3 columns
  • enronSpamSubset.csv has 10,000 rows and 4 columns
  • lingSpam.csv has 2,605 rows and 3 columns
  • Column names are not consistent across files.

As per the output, we got to know - columns are not standardized, hence we will standardize them before merging. We will extract just the email content and label from each dataset and rename them to match. Here is the code to do so:

# For SpamAssassin (df1)
df1 = df1[['Body', 'Label']].rename(columns={'Body': 'text', 'Label': 'label'})

# For Enron (df2) -- assume 'Body' is the email content
df2 = df2[['Body', 'Label']].rename(columns={'Body': 'text', 'Label': 'label'})

# For LingSpam (df3)
df3 = df3[['Body', 'Label']].rename(columns={'Body': 'text', 'Label': 'label'})

Now let’s combine them into a single dataset, using the code given below:

# Combine all three datasets
df = pd.concat([df1, df2, df3], ignore_index=True)

# Preview the merged data
df.head()

Check the structure using the code given below:

df.info()
df['label'].value_counts()

Output:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 18651 entries, 0 to 18650

Data columns (total 2 columns):

 #   Column  Non-Null Count  Dtype 

---  ------  --------------  ----- 

 0   text    18650 non-null  object

 1   label   18651 non-null  int64 

dtypes: int64(1), object(1)

memory usage: 291.6+ KB

label

count

0

11322

1

7329

dtype: int64

What does the output mean?

The output shows the dataset after merging. We get to know that:

  • There are in total of 18,651 emails. 
  • 11,322 are Ham (0)
  • 7,329 are Spam (1)

Step 3: Preprocess the Data

Before feeding the text to any machine learning model, we need to clean it up. Raw email content often includes unnecessary characters, URLs, numbers, etc. All this can reduce model performance. 

To fix this, let’s define a preprocessing function and apply it to the text column. Use the code given below to accomplish the same:

import re
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Set of English stopwords
stop_words = set(stopwords.words('english'))

# Function to clean each email (safe version)
def preprocess_text(text):
    text = str(text)  # Convert to string in case it's float or NaN
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'<.*?>', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\d+', '', text)
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    return " ".join(filtered_words)

# Apply to DataFrame
df['clean_text'] = df['text'].apply(preprocess_text)

Now, let's quickly compare raw vs cleaned text side by side. We will ensure that the preprocessing has removed:

  • Newlines (\n)
  • URLs
  • Numbers
  • Punctuation
  • Stopwords
  • Uppercase letters

Use the code to do this:

df[['text', 'clean_text']].head()

Output:

  text clean_text

0

\nSave up to 70% on Life Insurance.\nWhy Spend... save life insurance spend tolife quote savings...

1

1) Fight The Risk of Cancer!\nhttp://www.adcli... fight risk cancer slim guaranteed lose lbs day...

2

1) Fight The Risk of Cancer!\nhttp://www.adcli... fight risk cancer slim guaranteed lose lbs day...

3

##############################################... adult club offers free membership instant acce...

4

I thought you might like these:\n1) Slim Down ... thought might like slim guaranteed lose lbs da...


From the output, we can see that preprocessing worked perfectly. So let’s move ahead. 

Step 4: Vectorize the Cleaned Text 

Machine learning models can’t understand text directly. They only work with numbers. Therefore, we will convert clean_text into numerical vectors in this step. We will be using TF-IDF Vectorizer. Here is the code:

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the vectorizer
vectorizer = TfidfVectorizer(max_df=0.7)

# Fit and transform the clean_text column
X = vectorizer.fit_transform(df['clean_text'])

# Labels (spam or not)
y = df['label']

Step 5: Train-Test Split

Now that we have numerical data, let’s split the data to train and test. Use the code given below to do so:

from sklearn.model_selection import train_test_split

# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 6: Train and Evaluate Models

In this step, we will evaluate our machine learning models. We will start by testing four popular classifiers, all in one go. We will also compare their performance on this binary classification problem.

Here is the code:

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Dictionary of models to evaluate
models = {
    "Multinomial Naive Bayes": MultinomialNB(),
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree Classifier": DecisionTreeClassifier(),
    "Random Forest Classifier": RandomForestClassifier()
}

# Store results
results = {}

# Train and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    print(f"\n {name}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    
    results[name] = accuracy_score(y_test, y_pred)

Output:

Multinomial Naive Bayes

Accuracy: 0.9404985258643795

Confusion Matrix:

 [[2253   42]

 [ 180 1256]]

Classification Report:

                    precision   recall   f1-score  support

 

                 0       0.93      0.98       0.95      2295

                  1       0.97       0.87       0.92      1436

 

    accuracy                                     0.94      3731

   macro avg         0.95      0.93      0.94      3731

weighted avg       0.94      0.94      0.94      3731

 

 Logistic Regression

Accuracy: 0.9654248190833556

Confusion Matrix:

 [[2223   72]

 [  57 1379]]

Classification Report:

                        precision   recall    f1-score  support

 

                    0       0.97      0.97      0.97      2295

                    1       0.95      0.96      0.96      1436

 

    accuracy                                     0.97      3731

   macro avg         0.96      0.96      0.96      3731

weighted avg       0.97      0.97      0.97      3731

 

 Decision Tree Classifier

Accuracy: 0.9179844545698205

Confusion Matrix:

 [[2097  198]

 [ 108 1328]]

Classification Report:

                         precision  recall   f1-score   support

 

                    0       0.95      0.91      0.93      2295

                    1       0.87      0.92      0.90      1436

 

    accuracy                                     0.92      3731

   macro avg          0.91      0.92      0.91      3731

weighted avg       0.92      0.92      0.92      3731

 

 Random Forest Classifier

Accuracy: 0.9664969177164299

Confusion Matrix:

 [[2211   84]

 [  41 1395]]

Classification Report:

                        precision   recall  f1-score  support

 

                    0       0.98      0.96      0.97      2295

                    1       0.94      0.97      0.96      1436

 

    accuracy                                     0.97      3731

   macro avg         0.96      0.97      0.96      3731

weighted avg       0.97      0.97      0.97      3731

 

Step 7: Final Model Comparison

Let’s compare side by side how each model performed. This comparison will also help you understand the above output. 

Model

Accuracy

Precision (Class 1)

Recall (Class 1)

F1-Score (Class 1)

Remarks

Multinomial Naive Bayes

94.04%

0.97

0.87

0.92

Fast and lightweight, slightly lower recall

Logistic Regression

96.54%

0.95

0.96

0.96

High accuracy and balance

Decision Tree Classifier

91.79%

0.87

0.92

0.90

Prone to overfitting, lowest performance

Random Forest Classifier

96.65%

0.94

0.97

0.96

Best overall performer with high reliability

Conclusion

In this project, we built an email Classification System to detect spam and non-spam (ham) emails using machine learning. We trained and tested four models: Multinomial Naive Bayes, Logistic Regression, Decision Tree, and Random Forest.

Random Forest Classifier gave the best results. It achieved 96.65% accuracy, 0.94 precision, and the highest recall of 0.97. Logistic Regression also performed well, with balanced scores and 96.54% accuracy.
Multinomial Naive Bayes was fast and simple, but had slightly lower recall. Decision Tree Classifier showed the weakest performance and was prone to overfitting.

These results show that Random Forest is the most reliable choice for email spam detection. 

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/17qNg0jh-jlozEEkDobdeR_iCO5UIzuuJ?usp=sharing

Frequently Asked Questions (FAQs)

1. What is the purpose of Email Classification?

2. How does Email Classification work using NLP?

3. What machine learning models are commonly used?

4. What data is needed to build an email classification system?

5. What are the biggest challenges in Email Classification?

Rohit Sharma

802 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months