Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconFake News Detection in Machine Learning [Explained with Coding Example]

Fake News Detection in Machine Learning [Explained with Coding Example]

Last updated:
8th Feb, 2021
Views
Read Time
10 Mins
share image icon
In this article
Chevron in toc
View All
Fake News Detection in Machine Learning [Explained with Coding Example]

Fake news is one of the biggest issues in the current era of the internet and social media. While it’s a blessing that the news flows from one corner of the world to another in a matter of a few hours, it is also painful to see many people and groups spreading fake news.

Best Machine Learning and AI Courses Online

Machine Learning techniques using Natural Language Processing and Deep Learning can be used to tackle this problem to some extent. We will be building a Fake News Detection model using Machine Learning in this tutorial.

By the end of this article, you will know the following:

Ads of upGrad blog
  • Handling text data
  • NLP processing techniques
  • Count vectorization & TF-IDF
  • Making predictions and classifying news text

In-demand Machine Learning Skills

Join the AI & ML course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.

Data & Problem

We will be using the Kaggle Fake News challenge data to make a classifier. The dataset consists of 4 features and 1 binary target. The 4 features are as follows:

  1. id: unique id for a news article
  2. title: the title of a news article
  3. author: author of the news article
  4. text: the text of the article; could be incomplete

And the target is “label” which contains binary values 0s and 1s. Where 0 means it is a reliable source of news, or in other words, Not Fake. 1 means that it is a piece of potentially fake news and not reliable. The dataset we have consisted of 20800 instances. Let’s dive right in.

Data Pre-Processing & Cleaning

import pandas as pd
df=pd.read_csv(‘fake-news/train.csv’)
df.head()

X=df.drop(‘label’,axis=1) # Features
y=df[‘label’]             # Target

We need to drop instances with missing data now. 

df=df.dropna()

As we can see, it dropped all the instances with missing data. 

messages=df.copy()
messages.reset_index(inplace=True)
messages.head(10)

Let’s take a look at the data once.

messages[‘text’][6]

As we can see, there is a need to do the following steps:

  • Removing stopwords: There are a lot of words that add no value to any text no matter the data. For example, “I”, “a”, “am”, etc. These words have no informational value and hence can be removed to reduce the size of our corpus so that we can focus only on words/tokens that are of actual value.
  • Stemming the words: Stemming and Lemmatization are the techniques to reduce the words to their stems or roots. The main advantage of this step is to reduce the size of the vocabulary. For example, words like Play, Playing, Played will be reduced to “Play”. Stemming just truncates the words to the shortest word and doesn’t take into consideration the grammatical aspect of the text. Lemmatization, on the other hand, takes grammatical consideration as well and hence produces much better results. However, Lemmatization is usually slower than stemming as it needs to refer to the dictionary and take the grammatical aspect into consideration.
  • Removing everything apart from alphabetical values: Non-alphabetical values are not much useful here so they can be removed. However, you can explore further to see if the presence of numerical or other types of data has any impact on the target.
  • Lower case the words: Lower case the words to reduce vocabulary.
  • Tokenize the sentences: Generating tokens from sentences.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re
ps = PorterStemmer()
corpus = [] for i in range(0, len(messages)):
review = re.sub(‘[^a-zA-Z]’, ‘ ‘, messages[‘text’][i])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words(‘english’)] review = ‘ ‘.join(review)
corpus.append(review)

Let’s have a look at our corpus now.

corpus[3]

As we can see, the words are now stemmed to root words.

TF-IDF Vectorizer

Now we need to vectorize the words to numerical data which is also called vectorization. The easiest way to vectorize is to use the Bag of Words. But Bag of Words creates a sparse matrix and hence there is a lot of processing memory needed. Moreover, BoW does not take into consideration the frequency of words which makes it a bad algorithm.

TF-IDF (Term Frequency – Inverse Document Frequency) is another way to vectorize words that takes word frequencies into consideration. For example, common words such as “we”, “our”, “the” are in every document/instance hence the BoW value will be too high and hence misleading. This will lead to a bad model. TF-IDF is the multiplication of Term Frequency and Inverse Document Frequency.

Term Frequency takes into account the frequency of words in a document and Inverse Document Frequency takes into account the words that are present across the whole corpus. The words that are present across the whole corpus have reduced importance as the IDF value is a lot lower. The words that are present specifically in one document have a high IDF value which makes the total TF-IDF value high. 

## TFidf Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_features=5000,ngram_range=(1,3))
X=tfidf_v.fit_transform(corpus).toarray()y=messages[‘label’]

In the above code, we import the TF-IDF Vectorizer from Sklearn’s feature extraction module. We make its object by passing max_features as 5000 and ngram_range as (1,3). The parameter max_features define the maximum number of feature vectors that we want to create and the ngram_range parameter defines the ngram combinations we want to include. In our case, we will get 3 combinations of 1 word, 2 words, and 3 words. Let’s take a look at some of the features created.

tfidf_v.get_feature_names()[:20]

As we can see, there are multiple types of combinations formed. There are feature names with 1 token, 2 tokens, and also with 3 tokens.

Making a Dataframe

## Divide the dataset into Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

count_df = pd.DataFrame(X_train, columns=tfidf_v.get_feature_names())
count_df.head()

We split the data set into train and test so that we can test the model’s performance on unseen data. We then make a new Dataframe that contains the new feature vectors in it.

Modelling & Tuning

MultinomialNB Algorithm

First, we use the Multinomial Naive Bayes theorem which is the most common and easiest algorithm preferred for text data classification. We fit on the training data and predict on the test data. Later we calculate & plot the confusion matrix and get an accuracy of 88.1%. 

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import numpy as np
import itertools
from sklearn.metrics import plot_confusion_matrix

classifier=MultinomialNB()
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print(“accuracy:   %0.3f” % score)
cm = metrics.confusion_matrix(y_test, pred)
plot_confusion_matrix(cm, classes=[‘FAKE’, ‘REAL’])

Multinomial Classifier with Hyperparameter Tuning

MultinomialNB has a parameter alpha that can be tuned further. Hence we run a loop to try out multiple MultinomialNB classifiers with different alpha values and check their accuracy scores. And we check if the current score is more than the previous score. If it is, then we set the classifier as the current one.

previous_score=0
for alpha in np.arange(0,1,0.1):
    sub_classifier=MultinomialNB(alpha=alpha)
    sub_classifier.fit(X_train,y_train)
    y_pred=sub_classifier.predict(X_test)
    score = metrics.accuracy_score(y_test, y_pred)
    if score>previous_score:
        classifier=sub_classifier
    print(“Alpha: {}, Score : {}”.format(alpha,score))

Hence we can see that an alpha value of 0.9 or 0.8 gave the highest accuracy score.

Interpreting the Results

Now let’s see what these classifier coefficient values mean. We’ll first save all the feature names in another variable.

## Get Features names
feature_names = cv.get_feature_names()

Now, when we sort the values in reverse order, we get values with a minimum value of -4. These denote the words that are most real or least fake.

### Most real
sorted(zip(classifier.coef_[0], feature_names), reverse=True)[:20]

When we sort the values in non-reverse order, we get values with a minimum value of -10. These denote the words that are least real or most fake.

### Most real
sorted(zip(classifier.coef_[0], feature_names))[:20]

Ads of upGrad blog

Popular AI and ML Blogs & Free Courses

Conclusion

In this tutorial, we used ML algorithms only but you use other neural networks methods as well. Moreover, to vectorize the text data, we used the TF-IDF vectorizer. There more vectorizers like Count Vectorizer, Hashing Vectorizer, etc. as well which can be better in doing the job. Do try out and experiment with other algorithms and techniques to see if you can produce better results or not. 

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s Executive PG Programme in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Profile

Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Selectcaret down icon
Select Area of interestcaret down icon
Select Work Experiencecaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Frequently Asked Questions (FAQs)

1Why is there a need to detect fake news?

In their current condition, social media platforms are highly powerful and valuable since they allow users to discuss and exchange ideas as well as debate subjects such as democracy, education, and health. However, certain entities utilize such platforms badly, for monetary gain in some circumstances and for producing prejudiced viewpoints, altering mindsets, and disseminating satire or ridiculousness in others. Fake news is the term for this phenomenon. The proliferation of posting items online that do not adhere to reality has resulted in a slew of issues in politics, sports, health, science, and other fields.

2Which companies majorly make use of fake news detection?

Fake news detection is used on platforms such as social media and news websites. Social media behemoths like Facebook, Instagram, and Twitter are vulnerable to fake news since the majority of its users rely on them as daily news sources to get the most up-to-date information. Fake detection techniques are also used by media companies to determine the authenticity of the information they have. Email is another medium through which individuals can receive news, which makes it difficult to identify and verify their veracity. Hoaxes, spam, and junk mail are well-known for being transmitted over email. As a result, the majority of emailing platforms employ false news detection to identify spam and junk mail.

3What is Bag of Words or BoW?

A bag-of-words concept (BoW) is a method of extracting text attributes for use in modeling, such as Machine Learning techniques. The method is straightforward and adaptable, and it may be used to extract information from texts in a variety of ways. A bag-of-words is a text representation that specifies the frequency of words appearing in a document. It entails two components: a lexicon of recognized terms and a measure of their presence. Because all information about the sequence or structure of words in the text is deleted, it is referred to as a BAG of words. The model cares about whether or not recognized terms appear in the document, not where they appear.

Explore Free Courses

Suggested Blogs

Artificial Intelligence course fees
5458
Artificial intelligence (AI) was one of the most used words in 2023, which emphasizes how important and widespread this technology has become. If you
Read More

by venkatesh Rajanala

29 Feb 2024

Artificial Intelligence in Banking 2024: Examples & Challenges
6197
Introduction Millennials and their changing preferences have led to a wide-scale disruption of daily processes in many industries and a simultaneous g
Read More

by Pavan Vadapalli

27 Feb 2024

Top 9 Python Libraries for Machine Learning in 2024
75655
Machine learning is the most algorithm-intense field in computer science. Gone are those days when people had to code all algorithms for machine learn
Read More

by upGrad

19 Feb 2024

Top 15 IoT Interview Questions & Answers 2024 – For Beginners & Experienced
64481
These days, the minute you indulge in any technology-oriented discussion, interview questions on cloud computing come up in some form or the other. Th
Read More

by Kechit Goyal

19 Feb 2024

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow
153056
Summary: In this article, you will learn about data preprocessing in Machine Learning: 7 easy steps to follow. Acquire the dataset Import all the cr
Read More

by Kechit Goyal

18 Feb 2024

Artificial Intelligence Salary in India [For Beginners & Experienced] in 2024
908784
Artificial Intelligence (AI) has been one of the hottest buzzwords in the tech sphere for quite some time now. As Data Science is advancing, both AI a
Read More

by upGrad

18 Feb 2024

24 Exciting IoT Project Ideas & Topics For Beginners 2024 [Latest]
760624
Summary: In this article, you will learn the 24 Exciting IoT Project Ideas & Topics. Take a glimpse at the project ideas listed below. Smart Agr
Read More

by Kechit Goyal

18 Feb 2024

Natural Language Processing (NLP) Projects & Topics For Beginners [2023]
107775
What are Natural Language Processing Projects? NLP project ideas advanced encompass various applications and research areas that leverage computation
Read More

by Pavan Vadapalli

17 Feb 2024

45+ Interesting Machine Learning Project Ideas For Beginners [2024]
328419
Summary: In this Article, you will learn Stock Prices Predictor Sports Predictor Develop A Sentiment Analyzer Enhance Healthcare Prepare ML Algorith
Read More

by Jaideep Khare

16 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon