Fake News Detection Project Using Python and ML
By Rohan Vats
Updated on Jul 18, 2025 | 12 min read | 23.21K+ views
Share:
For working professionals
For fresh graduates
More
By Rohan Vats
Updated on Jul 18, 2025 | 12 min read | 23.21K+ views
Share:
Table of Contents
Since false information can very quickly be spread through websites and social media, fake news detection has to be considered an important issue due to its ability to sow panic or sometimes even influence real-world decision-making. In this project, we intend to build an ML algorithm that classifies a news article automatically as real or fake, based solely on its content.
Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.
It is better to have at least some of the background in:
For this project, the following tools and libraries will be used:
Tool/Library |
Purpose |
Python |
Programming language |
Colab |
To write and run the code |
Pandas |
Data manipulation |
NumPy |
Array operations |
Matplotlib/Seaborn |
Plotting and data visualization |
Scikit-learn |
Machine learning models and evaluation |
NLTK |
Text preprocessing (like - stopwords) |
Two models that are lightweight yet very effective in solving binary text classification problems will be used.
upGrad’s Exclusive Data Science Webinar for you –
Transformation & Opportunities in Analytics & Insights
You can complete this project in around 2 hours. It serves as a great beginner-level hands-on introduction to NLP and text classification.
Popular Data Science Programs
Let’s start building the project from scratch. We will start by:
Without any further delay, let’s start!
To train our fake news detection model, we will use the fake and real news dataset available on Kaggle. It contains two .csv files. One with real news articles (True.csv) and another with fake news articles (False.csv).
Follow the steps mentioned below to download the dataset:
Now that you have downloaded both files, upload them to Google Colab using the code below:
from google.colab import files
uploaded = files.upload()
Note: It will take 10-15 minutes to get both the files uploaded.
Once both the files (False.csv and True.csv) have been uploaded, load the data using the below code:
import pandas as pd
# Load the fake and real news datasets
df_fake = pd.read_csv("Fake.csv")
df_true = pd.read_csv("True.csv")
# Preview a few rows
df_fake.head()
Output
title |
text |
subject |
date |
|
0 |
Donald Trump Sends Out Embarrassing New Year’... | Donald Trump just couldn t wish all Americans ... | News |
December 31, 2017 |
1 |
Drunk Bragging Trump Staffer Started Russian ... | House Intelligence Committee Chairman Devin Nu... | News |
December 31, 2017 |
2 |
Sheriff David Clarke Becomes An Internet Joke... | On Friday, it was revealed that former Milwauk... | News |
December 30, 2017 |
3 |
Trump Is So Obsessed He Even Has Obama’s Name... | On Christmas day, Donald Trump announced that ... | News |
December 29, 2017 |
4 |
Pope Francis Just Called Out Donald Trump Dur... | Pope Francis used his annual Christmas Day mes... | News |
December 25, 2017
|
After we loaded both the datasets, it is time to prepare them for training. We will add a new column called label to each dataset:
After that, we combine these two datasets into a single DataFrame. Here is the code that accomplishes this:
# Add labels to each dataset
df_fake['label'] = 0 # 0 = Fake news
df_true['label'] = 1 # 1 = Real news
# Combine the two datasets
df = pd.concat([df_fake, df_true], axis=0)
df = df.reset_index(drop=True)
# Display the first few rows
df.head()
Output
title |
text |
subject |
date |
label |
|
0 |
Donald Trump Sends Out Embarrassing New Year’... | Donald Trump just couldn t wish all Americans ... | News |
December 31, 2017 |
0 |
1 |
Drunk Bragging Trump Staffer Started Russian ... | House Intelligence Committee Chairman Devin Nu... | News |
December 31, 2017 |
0 |
2 |
Sheriff David Clarke Becomes An Internet Joke... | On Friday, it was revealed that former Milwauk... | News |
December 30, 2017 |
0 |
3 |
Trump Is So Obsessed He Even Has Obama’s Name... | On Christmas day, Donald Trump announced that ... | News |
December 29, 2017 |
0 |
4 |
Pope Francis Just Called Out Donald Trump Dur... | Pope Francis used his annual Christmas Day mes... | News |
December 25, 2017 |
0 |
Before training our model, we need to clean the news content. Raw text often contains noise like - punctuation, links, numbers, and stopwords that don’t help with prediction.
We will create a function to remove:
Doing so will help the model focus only on meaningful words.
Here is the code:
import re
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
# Load English stopwords
stop_words = stopwords.words('english')
# Function to clean text
def clean_text(text):
text = text.lower() # Convert to lowercase
text = re.sub(r'\[.*?\]', '', text) # Remove text in brackets
text = re.sub(r'https?://\S+|www\.\S+', '', text) # Remove URLs
text = re.sub(r'<.*?>+', '', text) # Remove HTML tags
text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text) # Remove punctuation
text = re.sub(r'\n', ' ', text) # Remove newline characters
text = re.sub(r'\w*\d\w*', '', text) # Remove words with numbers
text = ' '.join([word for word in text.split() if word not in stop_words]) # Remove stopwords
return text
We will clean only the text column, since that’s what we will use to train our model. Here is the code to do so:
df['text'] = df['text'].apply(clean_text)
In our dataset, we have extra columns that are not going to be used for any prediction: the title, subject, and date. As we will be working only with the cleaned text and the label, we can safely discard them.
The following is the code to do so:
# Drop the columns we don't need
df = df.drop(['title', 'subject', 'date'], axis=1)
# Check the updated DataFrame
df.head()
Output
text | label | |
0 |
donald trump wish americans happy new year lea... | 0 |
1 |
house intelligence committee chairman devin nu... | 0 |
2 |
friday revealed former milwaukee sheriff david... | 0 |
3 |
christmas day donald trump announced would bac... | 0 |
4 |
pope francis used annual christmas day message... | 0 |
Thus, we have the clean text and the labels. The next step is to convert text to numbers in a way that can be understood by the machine learning model.
Machine learning algorithms cannot operate upon words. They do so upon numbers. To convert or vectorize our cleaned news text into a numerical format, we will use TF-IDF (Term Frequency–Inverse Document Frequency). It gives higher importance to words that are frequent in a document but rare across others.
Here is the code to do so:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
# Define input (text) and target (label)
X = df['text']
y = df['label']
# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
# Transform the text into TF-IDF features
X_vectorized = vectorizer.fit_transform(X)
# Split the data into training and testing sets (75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.25, random_state=42)
Now that we have transformed the text into numbers, it’s time to train our machine learning models. We will start with two simple and effective classifiers:
We will fit both models to the training data and then test how well they perform on the test set.
Model 1: Passive Aggressive Classifier
This model updates itself only when it gets a prediction wrong. It’s fast and works well with large datasets.
Here is the code:
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
# Initialize the model
pac = PassiveAggressiveClassifier()
# Train the model
pac.fit(X_train, y_train)
# Predict on the test set
y_pred_pac = pac.predict(X_test)
# Evaluate
print("PAC Accuracy:", accuracy_score(y_test, y_pred_pac))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_pac))
Output
PAC Accuracy: 0.9958129175946547
Confusion Matrix:
[[5871 24]
[ 23 5307]]
Model 2: Multinomial Naive Bayes
This is a simple and effective model for text classification. It works well when features (words) are treated independently.
Here is the code:
from sklearn.naive_bayes import MultinomialNB
# Initialize the model
nb = MultinomialNB()
# Train the model
nb.fit(X_train, y_train)
# Predict on the test set
y_pred_nb = nb.predict(X_test)
# Evaluate
print("NB Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_nb))
Output
NB Accuracy: 0.9488641425389756
Confusion Matrix:
[[5529 366]
[ 208 5122]]
Accuracy alone doesn’t tell the full story. To better understand how each model performed, we will generate classification/evaluation metrics/reports showing precision, recall, and F1-score.
These metrics help us check:
Report for Passive Aggressive Classifier
Let’s check the detailed performance of the Passive Aggressive model.
from sklearn.metrics import classification_report
# Classification report for PAC
print("Classification Report - Passive Aggressive Classifier:\n")
print(classification_report(y_test, y_pred_pac))
Classification Report - Passive Aggressive Classifier:
precision recall f1-score support
0 1.00 1.00 1.00 5895
1 1.00 1.00 1.00 5330
accuracy 1.00 11225
macro avg 1.00 1.00 1.00 11225
weighted avg 1.00 1.00 1.00 11225
Report for Multinomial Naive Bayes
Now we will generate the report for the Naive Bayes model.
# Classification report for Naive Bayes
print("Classification Report - Multinomial Naive Bayes:\n")
print(classification_report(y_test, y_pred_nb))
Output
Classification Report - Multinomial Naive Bayes:
precision recall f1-score support
0 0.96 0.94 0.95 5895
1 0.93 0.96 0.95 5330
accuracy 0.95 11225
macro avg 0.95 0.95 0.95 11225
weighted avg 0.95 0.95 0.95 11225
What do these reports tell?
These report tell us:
Now that we have seen the performance of both models, let’s quickly compare their results and summarize what we learned from this project.
Metric |
Passive Aggressive |
Naive Bayes |
Accuracy |
~99% |
~93% |
Speed |
Fast |
Very Fast |
Best For |
Real-time updates |
Clean, balanced text data |
Precision/Recall |
High (both classes) |
Slightly lower than PAC |
Both classifiers did pretty well on fake news. For accuracy and speed, the Passive Aggressive Classifier is preferable. The Naive Bayes method, by comparison, is simpler and faster and also performs well-but its real advantage is when using a quick prototype.
Throughout this project, you have learned to apply machine learning in fighting misinformation by analyzing news content. You have also learned how to clean text, turn text into features, and evaluate model performance.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Reference:
https://www.kaggle.com/datasets/jainpooja/fake-news-detection
Colab Link -
https://colab.research.google.com/drive/1WuJPnzQFm2kQ5W3y6r1MrWIvhFHA3__U?usp=sharing
408 articles published
Rohan Vats is a Senior Engineering Manager with over a decade of experience in building scalable frontend architectures and leading high-performing engineering teams. Holding a B.Tech in Computer Scie...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources