Fake News Detection Project Using Python and ML

By Rohan Vats

Updated on Jul 18, 2025 | 12 min read | 23.21K+ views

Share:

Since false information can very quickly be spread through websites and social media, fake news detection has to be considered an important issue due to its ability to sow panic or sometimes even influence real-world decision-making. In this project, we intend to build an ML algorithm that classifies a news article automatically as real or fake, based solely on its content. 

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog. 

What Should You Know Beforehand?

It is better to have at least some of the background in:

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool/Library

Purpose

Python

Programming language

Colab

To write and run the code

Pandas

Data manipulation

NumPy

Array operations

Matplotlib/Seaborn

Plotting and data visualization

Scikit-learn

Machine learning models and evaluation

NLTK

Text preprocessing (like - stopwords)

Models That Will Be Utilized for Learning 

Two models that are lightweight yet very effective in solving binary text classification problems will be used.

  • Passive Aggressive Classifier: Suitable for large-scale learning and can easily utilize online data streams for classification. It updates itself aggressively only in case it makes an error in classification but unlike most learners it does not update itself when it succeeds in classification. 
  • Multinomial Naive Bayes: An independent-word assumption model. It is fast, efficient, and is quite popular for spam or fake news detection.

 

upGrad’s Exclusive Data Science Webinar for you –

Transformation & Opportunities in Analytics & Insights

 

Time Taken and Difficulty

You can complete this project in around 2 hours. It serves as a great beginner-level hands-on introduction to NLP and text classification.

 

How to Build a Fake News Detection Model

Let’s start building the project from scratch. We will start by:

  1. Loading the dataset
  2. Cleaning the text
  3. Training our model

Without any further delay, let’s start!

Step 1: Download the Dataset

To train our fake news detection model, we will use the fake and real news dataset available on Kaggle. It contains two .csv files. One with real news articles (True.csv) and another with fake news articles (False.csv).

Follow the steps mentioned below to download the dataset:

  1. Open a new tab in any web browser. 
  2. Type https://www.kaggle.com/datasets/jainpooja/fake-news-detection
  3. On the Fake News Detection page, in the right pane, under the Data Explorer section, click Fake.csv
  4. Click the download icon
  5. Once downloaded, click on the True.csv file. 
  6. Click the download icon.
  7. Navigate to the download folder and extract the files. 

Step 2: Upload and Load the Dataset in Google Colab

Now that you have downloaded both files, upload them to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()

Note: It will take 10-15 minutes to get both the files uploaded. 

Once both the files (False.csv and True.csv) have been uploaded, load the data using the below code:

import pandas as pd

# Load the fake and real news datasets
df_fake = pd.read_csv("Fake.csv")
df_true = pd.read_csv("True.csv")

# Preview a few rows
df_fake.head()

Output

 

title

text

subject

date

0

Donald Trump Sends Out Embarrassing New Year’... Donald Trump just couldn t wish all Americans ...

News

December 31, 2017

1

Drunk Bragging Trump Staffer Started Russian ... House Intelligence Committee Chairman Devin Nu...

News

December 31, 2017

2

Sheriff David Clarke Becomes An Internet Joke... On Friday, it was revealed that former Milwauk...

News

December 30, 2017

3

Trump Is So Obsessed He Even Has Obama’s Name... On Christmas day, Donald Trump announced that ...

News

December 29, 2017

4

Pope Francis Just Called Out Donald Trump Dur... Pope Francis used his annual Christmas Day mes...

News

December 25, 2017

 

Step 3: Add Labels and Combine the Datasets

After we loaded both the datasets, it is time to prepare them for training. We will add a new column called label to each dataset:

  • Assign 0 to all fake news articles (from Fake.csv)
  • Assign 1 for all real news articles (from True.csv)

After that, we combine these two datasets into a single DataFrame. Here is the code that accomplishes this:

# Add labels to each dataset
df_fake['label'] = 0  # 0 = Fake news
df_true['label'] = 1  # 1 = Real news

# Combine the two datasets
df = pd.concat([df_fake, df_true], axis=0)
df = df.reset_index(drop=True)

# Display the first few rows
df.head()

Output

 

title

text

subject

date

label

0

Donald Trump Sends Out Embarrassing New Year’... Donald Trump just couldn t wish all Americans ...

News

December 31, 2017

0

1

Drunk Bragging Trump Staffer Started Russian ... House Intelligence Committee Chairman Devin Nu...

News

December 31, 2017

0

2

Sheriff David Clarke Becomes An Internet Joke... On Friday, it was revealed that former Milwauk...

News

December 30, 2017

0

3

Trump Is So Obsessed He Even Has Obama’s Name... On Christmas day, Donald Trump announced that ...

News

December 29, 2017

0

4

Pope Francis Just Called Out Donald Trump Dur... Pope Francis used his annual Christmas Day mes...

News

December 25, 2017

0

Step 4: Clean the Text Data

Before training our model, we need to clean the news content. Raw text often contains noise like - punctuation, links, numbers, and stopwords that don’t help with prediction.

We will create a function to remove:

  • Punctuation and special characters
  • Numbers and extra spaces
  • Common words like - the, and, is, etc. (called stopwords)

Doing so will help the model focus only on meaningful words.

Here is the code:

import re
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Load English stopwords
stop_words = stopwords.words('english')

# Function to clean text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\[.*?\]', '', text)  # Remove text in brackets
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
    text = re.sub(r'<.*?>+', '', text)  # Remove HTML tags
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)  # Remove punctuation
    text = re.sub(r'\n', ' ', text)  # Remove newline characters
    text = re.sub(r'\w*\d\w*', '', text)  # Remove words with numbers
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

We will clean only the text column, since that’s what we will use to train our model. Here is the code to do so:

df['text'] = df['text'].apply(clean_text)

Step 5: Drop Unused Columns and Prepare the Data for Training

In our dataset, we have extra columns that are not going to be used for any prediction: the title, subject, and date. As we will be working only with the cleaned text and the label, we can safely discard them.

The following is the code to do so:

# Drop the columns we don't need
df = df.drop(['title', 'subject', 'date'], axis=1)

# Check the updated DataFrame
df.head()

Output

  text label

0

donald trump wish americans happy new year lea... 0

1

house intelligence committee chairman devin nu... 0

2

friday revealed former milwaukee sheriff david... 0

3

christmas day donald trump announced would bac... 0

4

pope francis used annual christmas day message... 0

Thus, we have the clean text and the labels. The next step is to convert text to numbers in a way that can be understood by the machine learning model.

Step 6: Convert Text to Vectors (TF-IDF)

Machine learning algorithms cannot operate upon words. They do so upon numbers. To convert or vectorize our cleaned news text into a numerical format, we will use TF-IDF (Term Frequency–Inverse Document Frequency). It gives higher importance to words that are frequent in a document but rare across others.

Here is the code to do so:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Define input (text) and target (label)
X = df['text']
y = df['label']

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the text into TF-IDF features
X_vectorized = vectorizer.fit_transform(X)

# Split the data into training and testing sets (75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.25, random_state=42)

Step 7: Train the Model

Now that we have transformed the text into numbers, it’s time to train our machine learning models. We will start with two simple and effective classifiers:

  • Passive Aggressive Classifier
  • Multinomial Naive Bayes

We will fit both models to the training data and then test how well they perform on the test set.

Model 1: Passive Aggressive Classifier

This model updates itself only when it gets a prediction wrong. It’s fast and works well with large datasets.

Here is the code:

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Initialize the model
pac = PassiveAggressiveClassifier()

# Train the model
pac.fit(X_train, y_train)

# Predict on the test set
y_pred_pac = pac.predict(X_test)

# Evaluate
print("PAC Accuracy:", accuracy_score(y_test, y_pred_pac))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_pac))

Output

PAC Accuracy: 0.9958129175946547

Confusion Matrix:

[[5871   24]

[  23 5307]]

Model 2: Multinomial Naive Bayes

This is a simple and effective model for text classification. It works well when features (words) are treated independently.

Here is the code:

from sklearn.naive_bayes import MultinomialNB

# Initialize the model
nb = MultinomialNB()

# Train the model
nb.fit(X_train, y_train)

# Predict on the test set
y_pred_nb = nb.predict(X_test)

# Evaluate
print("NB Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_nb))

Output

NB Accuracy: 0.9488641425389756

Confusion Matrix:

[[5529  366]

[ 208 5122]]

Step 8: Generate Classification Reports

Accuracy alone doesn’t tell the full story. To better understand how each model performed, we will generate classification/evaluation metrics/reports showing precision, recall, and F1-score.

These metrics help us check:

  • How many fake/real articles were correctly predicted
  • How often the model makes false predictions

Report for Passive Aggressive Classifier

Let’s check the detailed performance of the Passive Aggressive model.

from sklearn.metrics import classification_report

# Classification report for PAC
print("Classification Report - Passive Aggressive Classifier:\n")
print(classification_report(y_test, y_pred_pac))

Classification Report - Passive Aggressive Classifier:

 

              precision    recall  f1-score   support

 

           0       1.00      1.00      1.00      5895

                 1.00      1.00      1.00      5330

 

    accuracy                           1.00     11225

   macro avg       1.00      1.00      1.00     11225

weighted avg       1.00      1.00      1.00     11225

Report for Multinomial Naive Bayes

Now we will generate the report for the Naive Bayes model.

# Classification report for Naive Bayes
print("Classification Report - Multinomial Naive Bayes:\n")
print(classification_report(y_test, y_pred_nb))

Output

Classification Report - Multinomial Naive Bayes:

 

              precision    recall  f1-score   support

 

                 0.96      0.94      0.95      5895

           1       0.93      0.96      0.95      5330

 

    accuracy                           0.95     11225

   macro avg       0.95      0.95      0.95     11225

weighted avg       0.95      0.95      0.95     11225

What do these reports tell?

These report tell us:

  • Precision: Out of the predicted real/fake, how many were correct
  • Recall: Out of all actual real/fake articles, how many did the model find
  • F1-score: Balance between precision and recall

Step 9: Final Comparison and Conclusion

Now that we have seen the performance of both models, let’s quickly compare their results and summarize what we learned from this project.

Metric

Passive Aggressive

Naive Bayes

Accuracy

~99%

~93%

Speed

Fast

Very Fast

Best For

Real-time updates

Clean, balanced text data

Precision/Recall

High (both classes)

Slightly lower than PAC

Conclusion

Both classifiers did pretty well on fake news. For accuracy and speed, the Passive Aggressive Classifier is preferable. The Naive Bayes method, by comparison, is simpler and faster and also performs well-but its real advantage is when using a quick prototype. 

Throughout this project, you have learned to apply machine learning in fighting misinformation by analyzing news content. You have also learned how to clean text, turn text into features, and evaluate model performance.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Reference:
https://www.kaggle.com/datasets/jainpooja/fake-news-detection

Colab Link - 
https://colab.research.google.com/drive/1WuJPnzQFm2kQ5W3y6r1MrWIvhFHA3__U?usp=sharing

Rohan Vats

408 articles published

Rohan Vats is a Senior Engineering Manager with over a decade of experience in building scalable frontend architectures and leading high-performing engineering teams. Holding a B.Tech in Computer Scie...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months