Fake News Detection Project Using Python and ML

By Rohan Vats

Updated on Sep 10, 2025 | 10 min read | 23.48K+ views

Share:

Ever scrolled through your social media feed and wondered, "Is this story actually true?" In today's world, false information spreads like wildfire, causing confusion and panic. It's a huge problem, but thankfully, technology can help fight back.

This project is all about building a smart solution for fake news detection. We're going to teach a machine learning model to act like a digital fact-checker. It will learn to read an article and, based only on the words it sees, decide whether the news is real or fake.

 

Don't just learn data science, launch your career. Our Online Data Science Courses at upGrad provide the fastest path to mastering job-ready skills. Go from theory to practice with Python, Machine Learning, AI, and Tableau, all taught by industry experts. Your high-growth career starts here. Explore courses now!

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.

What Should You Know Beforehand?

It is better to have at least some of the background in:

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool/Library

Purpose

Python

Programming language

Colab

To write and run the code

Pandas

Data manipulation

NumPy

Array operations

Matplotlib/Seaborn

Plotting and data visualization

Scikit-learn

Machine learning models and evaluation

NLTK

Text preprocessing (like - stopwords)

Level up your data science game with upGrad’s top-rated courses. Get mentored by industry pros, build real skills, and fast-track your path to a standout career.

Models That Will Be Utilized for Learning 

Two models that are lightweight yet very effective in solving binary text classification problems will be used.

  • Passive Aggressive Classifier: Suitable for large-scale learning and can easily utilize online data streams for classification. It updates itself aggressively only in case it makes an error in classification but unlike most learners it does not update itself when it succeeds in classification. 
  • Multinomial Naive Bayes: An independent-word assumption model. It is fast, efficient, and quite popular for spam or fake news detection.

upGrad’s Exclusive Data Science Webinar for you –

Transformation & Opportunities in Analytics & Insights

 

Time Taken and Difficulty

You can complete this project in around 2 hours. It serves as a great beginner-level hands-on introduction to NLP and text classification.

How to Build a Fake News Detection Model

Let’s start building the project from scratch. We will start by:

  1. Loading the dataset
  2. Cleaning the text
  3. Training our model

Without any further delay, let’s start!

Step 1: Download the Dataset

To train our fake news detection model, we will use the fake and real news dataset available on Kaggle. It contains two .csv files. One with real news articles (True.csv) and another with fake news articles (False.csv).

Follow the steps mentioned below to download the dataset:

  1. Open a new tab in any web browser. 
  2. Type https://www.kaggle.com/datasets/jainpooja/fake-news-detection
  3. On the Fake News Detection page, in the right pane, under the Data Explorer section, click Fake.csv
  4. Click the download icon
  5. Once downloaded, click on the True.csv file. 
  6. Click the download icon.
  7. Navigate to the download folder and extract the files. 

Step 2: Upload and Load the Dataset in Google Colab

Now that you have downloaded both files, upload them to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()

Note: It will take 10-15 minutes to get both the files uploaded. 

Once both the files (False.csv and True.csv) have been uploaded, load the data using the below code:

import pandas as pd

# Load the fake and real news datasets
df_fake = pd.read_csv("Fake.csv")
df_true = pd.read_csv("True.csv")

# Preview a few rows
df_fake.head()

Output:

 

title

text

subject

date

0

Donald Trump Sends Out Embarrassing New Year’... Donald Trump just couldn t wish all Americans ...

News

December 31, 2017

1

Drunk Bragging Trump Staffer Started Russian ... House Intelligence Committee Chairman Devin Nu...

News

December 31, 2017

2

Sheriff David Clarke Becomes An Internet Joke... On Friday, it was revealed that former Milwauk...

News

December 30, 2017

3

Trump Is So Obsessed He Even Has Obama’s Name... On Christmas day, Donald Trump announced that ...

News

December 29, 2017

4

Pope Francis Just Called Out Donald Trump Dur... Pope Francis used his annual Christmas Day mes...

News

December 25, 2017

 

 

Step 3: Add Labels and Combine the Datasets

After we loaded both the datasets, it is time to prepare them for training. We will add a new column called label to each dataset:

  • Assign 0 to all fake news articles (from Fake.csv)
  • Assign 1 for all real news articles (from True.csv)

After that, we combine these two datasets into a single DataFrame. Here is the code that accomplishes this:

# Add labels to each dataset
df_fake['label'] = 0  # 0 = Fake news
df_true['label'] = 1  # 1 = Real news

# Combine the two datasets
df = pd.concat([df_fake, df_true], axis=0)
df = df.reset_index(drop=True)

# Display the first few rows
df.head()

Output:

 

title

text

subject

date

label

0

Donald Trump Sends Out Embarrassing New Year’... Donald Trump just couldn t wish all Americans ...

News

December 31, 2017

0

1

Drunk Bragging Trump Staffer Started Russian ... House Intelligence Committee Chairman Devin Nu...

News

December 31, 2017

0

2

Sheriff David Clarke Becomes An Internet Joke... On Friday, it was revealed that former Milwauk...

News

December 30, 2017

0

3

Trump Is So Obsessed He Even Has Obama’s Name... On Christmas day, Donald Trump announced that ...

News

December 29, 2017

0

4

Pope Francis Just Called Out Donald Trump Dur... Pope Francis used his annual Christmas Day mes...

News

December 25, 2017

0

 

Step 4: Clean the Text Data

Before training our model, we need to clean the news content. Raw text often contains noise like - punctuation, links, numbers, and stopwords that don’t help with prediction.

We will create a function to remove:

  • Punctuation and special characters
  • Numbers and extra spaces
  • Common words like - the, and, is, etc. (called stopwords)

Doing so will help the model focus only on meaningful words.

Here is the code:

import re
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Load English stopwords
stop_words = stopwords.words('english')

# Function to clean text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\[.*?\]', '', text)  # Remove text in brackets
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
    text = re.sub(r'<.*?>+', '', text)  # Remove HTML tags
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)  # Remove punctuation
    text = re.sub(r'\n', ' ', text)  # Remove newline characters
    text = re.sub(r'\w*\d\w*', '', text)  # Remove words with numbers
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

We will clean only the text column, since that’s what we will use to train our model. Here is the code to do so:

df['text'] = df['text'].apply(clean_text)

Step 5: Drop Unused Columns and Prepare the Data for Training

In our dataset, we have extra columns that are not going to be used for any prediction: the title, subject, and date. As we will be working only with the cleaned text and the label, we can safely discard them.

The following is the code to do so:

# Drop the columns we don't need
df = df.drop(['title', 'subject', 'date'], axis=1)

# Check the updated DataFrame
df.head()

Output:

  text label

0

donald trump wish americans happy new year lea... 0

1

house intelligence committee chairman devin nu... 0

2

friday revealed former milwaukee sheriff david... 0

3

christmas day donald trump announced would bac... 0

4

pope francis used annual christmas day message... 0

Thus, we have the clean text and the labels. The next step is to convert text to numbers in a way that can be understood by the machine learning model.

Step 6: Convert Text to Vectors (TF-IDF)

Machine learning algorithms cannot operate upon words. They do so upon numbers. To convert or vectorize our cleaned news text into a numerical format, we will use TF-IDF (Term Frequency–Inverse Document Frequency). It gives higher importance to words that are frequent in a document but rare across others.

Here is the code to do so:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Define input (text) and target (label)
X = df['text']
y = df['label']

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the text into TF-IDF features
X_vectorized = vectorizer.fit_transform(X)

# Split the data into training and testing sets (75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.25, random_state=42)

Step 7: Train the Model

Now that we have transformed the text into numbers, it’s time to train our machine learning models. We will start with two simple and effective classifiers:

  • Passive Aggressive Classifier
  • Multinomial Naive Bayes

We will fit both models to the training data and then test how well they perform on the test set.

Model 1: Passive Aggressive Classifier

This model updates itself only when it gets a prediction wrong. It’s fast and works well with large datasets.

Here is the code:

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Initialize the model
pac = PassiveAggressiveClassifier()

# Train the model
pac.fit(X_train, y_train)

# Predict on the test set
y_pred_pac = pac.predict(X_test)

# Evaluate
print("PAC Accuracy:", accuracy_score(y_test, y_pred_pac))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_pac))

Output:

PAC Accuracy: 0.9958129175946547
Confusion Matrix:
[[5871   24]
[  23 5307]]

Model 2: Multinomial Naive Bayes

This is a simple and effective model for text classification. It works well when features (words) are treated independently.

Here is the code:

from sklearn.naive_bayes import MultinomialNB

# Initialize the model
nb = MultinomialNB()

# Train the model
nb.fit(X_train, y_train)

# Predict on the test set
y_pred_nb = nb.predict(X_test)

# Evaluate
print("NB Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_nb))

Output:

NB Accuracy: 0.9488641425389756
Confusion Matrix:
[[5529  366]
[ 208 5122]]

Step 8: Generate Classification Reports

Accuracy alone doesn’t tell the full story. To better understand how each model performed, we will generate classification/evaluation metrics/reports showing precision, recall, and F1-score.

These metrics help us check:

  • How many fake/real articles were correctly predicted
  • How often the model makes false predictions

Report for Passive Aggressive Classifier

Let’s check the detailed performance of the Passive Aggressive model.

from sklearn.metrics import classification_report

# Classification report for PAC
print("Classification Report - Passive Aggressive Classifier:\n")
print(classification_report(y_test, y_pred_pac))

Classification Report - Passive Aggressive Classifier:

              precision    recall  f1-score   support

            0            1.00      1.00      1.00      5895
                        1.00      1.00      1.00      5330

    accuracy                                  1.00      11225
   macro avg       1.00      1.00      1.00     11225
weighted avg     1.00      1.00      1.00     11225

Report for Multinomial Naive Bayes

Now we will generate the report for the Naive Bayes model.

# Classification report for Naive Bayes
print("Classification Report - Multinomial Naive Bayes:\n")
print(classification_report(y_test, y_pred_nb))

Output:

Classification Report - Multinomial Naive Bayes:

              precision    recall  f1-score   support

                 0.96      0.94      0.95      5895
           1        0.93      0.96      0.95      5330

    accuracy                              0.95     11225
   macro avg      0.95  0.95     0.95     11225
 weighted avg   0.95   0.95     0.95    11225

What do these reports tell?

These report tell us:

  • Precision: Out of the predicted real/fake, how many were correct
  • Recall: Out of all actual real/fake articles, how many did the model find
  • F1-score: Balance between precision and recall

Step 9: Final Comparison and Conclusion

Now that we have seen the performance of both models, let’s quickly compare their results and summarize what we learned from this project.

Metric

Passive Aggressive

Naive Bayes

Accuracy

~99%

~93%

Speed

Fast

Very Fast

Best For

Real-time updates

Clean, balanced text data

Precision/Recall

High (both classes)

Slightly lower than PAC

Conclusion

This project successfully demonstrates how machine learning can be a powerful tool in the fight against misinformation. Our comparison revealed that while Naive Bayes is great for quick tests, the Passive Aggressive Classifier is often a great solution for a real-world fake news detection system.

Beyond just the models, you've mastered the essential workflow of any NLP project: text preprocessing, feature extraction, and model evaluation. These are the foundational skills needed to build intelligent systems that can help create a more informed and trustworthy online environment.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Reference Links:
https://www.kaggle.com/datasets/jainpooja/fake-news-detection
https://www.knowledgehut.com/tutorials/machine-learning/remove-stop-words-nltk-machine-learning-python

Colab Link:
https://colab.research.google.com/drive/1WuJPnzQFm2kQ5W3y6r1MrWIvhFHA3__U?usp=sharing

Frequently Asked Questions (FAQs)

1. Why was the TfidfVectorizer chosen for this project instead of a simpler method like a CountVectorizer?

The TfidfVectorizer, which stands for Term Frequency-Inverse Document Frequency, was chosen because it provides a more nuanced way to represent text data than a simple word count. While a CountVectorizer would only count the number of times a word appears in an article, TF-IDF goes a step further. It weighs the importance of a word by considering how frequently it appears in a specific document (Term Frequency) and how rare it is across all documents in the dataset (Inverse Document Frequency). This is crucial for fake news detection, as certain words might appear frequently in both real and fake articles, but words that are unique to fake news (like sensational or exaggerated terms) will be given a higher score, making them more significant features for the machine learning model to learn from.

2. The project uses a PassiveAggressiveClassifier. What is this model, and why is it a good choice for this task?

The PassiveAggressiveClassifier is a type of online learning algorithm particularly well-suited for large-scale text classification tasks. It's called "passive" when it encounters a correct classification, meaning it doesn't adjust the model. It becomes "aggressive" when it encounters a misclassification, meaning it updates its weights to correct for the error. This makes it efficient and effective for tasks like news detection, where you might be working with large streams of text data. It offers a great balance between simplicity and high performance, often achieving high accuracy without the computational overhead of more complex models like neural networks.

3. The model achieved a high accuracy score. Does high accuracy alone mean the model is good at fake news detection?

While a high accuracy score is a great starting point, it doesn't tell the whole story, especially for a classification problem. That's why the blog also includes a confusion matrix. The confusion matrix breaks down the results into True Positives, True Negatives, False Positives, and False Negatives. For fake news detection, you want to pay close attention to False Negatives (fake news incorrectly labeled as real) and False Positives (real news incorrectly labeled as fake). A truly effective model will have very low numbers for both of these, ensuring it is not only accurate overall but also reliable in correctly identifying both classes without significant error.

4. Can the model built in this tutorial be used on any news article from the internet? What are its limitations?

The model built in this tutorial is trained specifically on the provided fake.csv dataset. While it has learned patterns from that data, its performance on new, unseen articles from the internet might vary. Its primary limitation is that its knowledge is confined to the vocabulary and writing styles present in its training data. For example, it may struggle with different types of fake news (e.g., satire, clickbait, or propaganda) if they weren't well-represented in the original dataset. For a more robust, real-world news detection system, the model would need to be trained on a much larger and more diverse dataset covering various topics and sources over time.

5. How could this project be improved or taken to the next level?

There are several ways to advance this project. One major improvement would be to use more sophisticated language models like BERT or other transformers instead of TF-IDF, as these models can understand the context and semantic meaning of words, not just their frequency. Another step would be to experiment with different classifiers, such as Logistic Regression, Support Vector Machines, or even a simple neural network, to see if they can outperform the PassiveAggressiveClassifier. Finally, for a more comprehensive solution for fake news detection, you could incorporate additional features beyond just the article text, such as the headline, author, and source of the news, to build an even more accurate and reliable model.

Rohan Vats

408 articles published

Rohan Vats is a Senior Engineering Manager with over a decade of experience in building scalable frontend architectures and leading high-performing engineering teams. Holding a B.Tech in Computer Scie...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

upGrad
new course

Certification

30 Weeks

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months