Programs

Fake News Detection in Machine Learning [Explained with Coding Example]

Fake news is one of the biggest issues in the current era of the internet and social media. While it’s a blessing that the news flows from one corner of the world to another in a matter of a few hours, it is also painful to see many people and groups spreading fake news.

Machine Learning techniques using Natural Language Processing and Deep Learning can be used to tackle this problem to some extent. We will be building a Fake News Detection model using Machine Learning in this tutorial.

By the end of this article, you will know the following:

  • Handling text data
  • NLP processing techniques
  • Count vectorization & TF-IDF
  • Making predictions and classifying news text

Data & Problem

We will be using the Kaggle Fake News challenge data to make a classifier. The dataset consists of 4 features and 1 binary target. The 4 features are as follows:

  1. id: unique id for a news article
  2. title: the title of a news article
  3. author: author of the news article
  4. text: the text of the article; could be incomplete

And the target is “label” which contains binary values 0s and 1s. Where 0 means it is a reliable source of news, or in other words, Not Fake. 1 means that it is a piece of potentially fake news and not reliable. The dataset we have consisted of 20800 instances. Let’s dive right in.

Data Pre-Processing & Cleaning

import pandas as pd
df=pd.read_csv(‘fake-news/train.csv’)
df.head()

X=df.drop(‘label’,axis=1) # Features
y=df[‘label’]             # Target

We need to drop instances with missing data now. 

df=df.dropna()

As we can see, it dropped all the instances with missing data. 

messages=df.copy()
messages.reset_index(inplace=True)
messages.head(10)

Let’s take a look at the data once.

messages[‘text’][6]

As we can see, there is a need to do the following steps:

  • Removing stopwords: There are a lot of words that add no value to any text no matter the data. For example, “I”, “a”, “am”, etc. These words have no informational value and hence can be removed to reduce the size of our corpus so that we can focus only on words/tokens that are of actual value.
  • Stemming the words: Stemming and Lemmatization are the techniques to reduce the words to their stems or roots. The main advantage of this step is to reduce the size of the vocabulary. For example, words like Play, Playing, Played will be reduced to “Play”. Stemming just truncates the words to the shortest word and doesn’t take into consideration the grammatical aspect of the text. Lemmatization, on the other hand, takes grammatical consideration as well and hence produces much better results. However, Lemmatization is usually slower than stemming as it needs to refer to the dictionary and take the grammatical aspect into consideration.
  • Removing everything apart from alphabetical values: Non-alphabetical values are not much useful here so they can be removed. However, you can explore further to see if the presence of numerical or other types of data has any impact on the target.
  • Lower case the words: Lower case the words to reduce vocabulary.
  • Tokenize the sentences: Generating tokens from sentences.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re
ps = PorterStemmer()
corpus = [] for i in range(0, len(messages)):
review = re.sub(‘[^a-zA-Z]’, ‘ ‘, messages[‘text’][i])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words(‘english’)] review = ‘ ‘.join(review)
corpus.append(review)

Let’s have a look at our corpus now.

corpus[3]

As we can see, the words are now stemmed to root words.

TF-IDF Vectorizer

Now we need to vectorize the words to numerical data which is also called vectorization. The easiest way to vectorize is to use the Bag of Words. But Bag of Words creates a sparse matrix and hence there is a lot of processing memory needed. Moreover, BoW does not take into consideration the frequency of words which makes it a bad algorithm.

TF-IDF (Term Frequency – Inverse Document Frequency) is another way to vectorize words that takes word frequencies into consideration. For example, common words such as “we”, “our”, “the” are in every document/instance hence the BoW value will be too high and hence misleading. This will lead to a bad model. TF-IDF is the multiplication of Term Frequency and Inverse Document Frequency.

Term Frequency takes into account the frequency of words in a document and Inverse Document Frequency takes into account the words that are present across the whole corpus. The words that are present across the whole corpus have reduced importance as the IDF value is a lot lower. The words that are present specifically in one document have a high IDF value which makes the total TF-IDF value high. 

## TFidf Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_features=5000,ngram_range=(1,3))
X=tfidf_v.fit_transform(corpus).toarray()y=messages[‘label’]

In the above code, we import the TF-IDF Vectorizer from Sklearn’s feature extraction module. We make its object by passing max_features as 5000 and ngram_range as (1,3). The parameter max_features define the maximum number of feature vectors that we want to create and the ngram_range parameter defines the ngram combinations we want to include. In our case, we will get 3 combinations of 1 word, 2 words, and 3 words. Let’s take a look at some of the features created.

tfidf_v.get_feature_names()[:20]

As we can see, there are multiple types of combinations formed. There are feature names with 1 token, 2 tokens, and also with 3 tokens.

Making a Dataframe

## Divide the dataset into Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

count_df = pd.DataFrame(X_train, columns=tfidf_v.get_feature_names())
count_df.head()

We split the data set into train and test so that we can test the model’s performance on unseen data. We then make a new Dataframe that contains the new feature vectors in it.

Modeling & Tuning

MultinomialNB Algorithm

First, we use the Multinomial Naive Bayes theorem which is the most common and easiest algorithm preferred for text data classification. We fit on the training data and predict on the test data. Later we calculate & plot the confusion matrix and get an accuracy of 88.1%. 

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import numpy as np
import itertools
from sklearn.metrics import plot_confusion_matrix

classifier=MultinomialNB()
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print(“accuracy:   %0.3f” % score)
cm = metrics.confusion_matrix(y_test, pred)
plot_confusion_matrix(cm, classes=[‘FAKE’, ‘REAL’])

Multinomial Classifier with Hyperparameter Tuning

MultinomialNB has a parameter alpha that can be tuned further. Hence we run a loop to try out multiple MultinomialNB classifiers with different alpha values and check their accuracy scores. And we check if the current score is more than the previous score. If it is, then we set the classifier as the current one.

previous_score=0
for alpha in np.arange(0,1,0.1):
    sub_classifier=MultinomialNB(alpha=alpha)
    sub_classifier.fit(X_train,y_train)
    y_pred=sub_classifier.predict(X_test)
    score = metrics.accuracy_score(y_test, y_pred)
    if score>previous_score:
        classifier=sub_classifier
    print(“Alpha: {}, Score : {}”.format(alpha,score))

Hence we can see that an alpha value of 0.9 or 0.8 gave the highest accuracy score.

Interpreting the Results

Now let’s see what these classifier coefficient values mean. We’ll first save all the feature names in another variable.

## Get Features names
feature_names = cv.get_feature_names()

Now, when we sort the values in reverse order, we get values with a minimum value of -4. These denote the words that are most real or least fake.

### Most real
sorted(zip(classifier.coef_[0], feature_names), reverse=True)[:20]

When we sort the values in non-reverse order, we get values with a minimum value of -10. These denote the words that are least real or most fake.

### Most real
sorted(zip(classifier.coef_[0], feature_names))[:20]

Conclusion

In this tutorial, we used ML algorithms only but you use other neural networks methods as well. Moreover, to vectorize the text data, we used the TF-IDF vectorizer. There more vectorizers like Count Vectorizer, Hashing Vectorizer, etc. as well which can be better in doing the job. Do try out and experiment with other algorithms and techniques to see if you can produce better results or not. 

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Lead the AI Driven Technological Revolution

PG DIPLOMA IN MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE
ENROLL NOW @ UPGRAD

Leave a comment

Your email address will not be published.

Accelerate Your Career with upGrad

Our Popular Machine Learning Course

×