Fake news is one of the biggest issues in the current era of the internet and social media. While it’s a blessing that the news flows from one corner of the world to another in a matter of a few hours, it is also painful to see many people and groups spreading fake news.
Best Machine Learning Courses & AI Courses Online
Machine Learning techniques using Natural Language Processing and Deep Learning can be used to tackle this problem to some extent. We will be building a Fake News Detection model using Machine Learning in this tutorial.
By the end of this article, you will know the following:
- Handling text data
- NLP processing techniques
- Count vectorization & TF-IDF
- Making predictions and classifying news text
In-demand Machine Learning Skills
Join the AI & ML course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.
Data & Problem
We will be using the Kaggle Fake News challenge data to make a classifier. The dataset consists of 4 features and 1 binary target. The 4 features are as follows:
- id: unique id for a news article
- title: the title of a news article
- author: author of the news article
- text: the text of the article; could be incomplete
And the target is “label” which contains binary values 0s and 1s. Where 0 means it is a reliable source of news, or in other words, Not Fake. 1 means that it is a piece of potentially fake news and not reliable. The dataset we have consisted of 20800 instances. Let’s dive right in.
Data Pre-Processing & Cleaning
import pandas as pd df=pd.read_csv(‘fake-news/train.csv’) df.head() |
X=df.drop(‘label’,axis=1) # Features y=df[‘label’] # Target |
We need to drop instances with missing data now.
df=df.dropna() |
As we can see, it dropped all the instances with missing data.
messages=df.copy() messages.reset_index(inplace=True) messages.head(10) |
Let’s take a look at the data once.
messages[‘text’][6] |
As we can see, there is a need to do the following steps:
- Removing stopwords: There are a lot of words that add no value to any text no matter the data. For example, “I”, “a”, “am”, etc. These words have no informational value and hence can be removed to reduce the size of our corpus so that we can focus only on words/tokens that are of actual value.
- Stemming the words: Stemming and Lemmatization are the techniques to reduce the words to their stems or roots. The main advantage of this step is to reduce the size of the vocabulary. For example, words like Play, Playing, Played will be reduced to “Play”. Stemming just truncates the words to the shortest word and doesn’t take into consideration the grammatical aspect of the text. Lemmatization, on the other hand, takes grammatical consideration as well and hence produces much better results. However, Lemmatization is usually slower than stemming as it needs to refer to the dictionary and take the grammatical aspect into consideration.
- Removing everything apart from alphabetical values: Non-alphabetical values are not much useful here so they can be removed. However, you can explore further to see if the presence of numerical or other types of data has any impact on the target.
- Lower case the words: Lower case the words to reduce vocabulary.
- Tokenize the sentences: Generating tokens from sentences.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer import re ps = PorterStemmer() corpus = [] for i in range(0, len(messages)): review = re.sub(‘[^a-zA-Z]’, ‘ ‘, messages[‘text’][i]) review = review.lower() review = review.split() review = [ps.stem(word) for word in review if not word in stopwords.words(‘english’)] review = ‘ ‘.join(review) corpus.append(review) |
Let’s have a look at our corpus now.
corpus[3] |
As we can see, the words are now stemmed to root words.
TF-IDF Vectorizer
Now we need to vectorize the words to numerical data which is also called vectorization. The easiest way to vectorize is to use the Bag of Words. But Bag of Words creates a sparse matrix and hence there is a lot of processing memory needed. Moreover, BoW does not take into consideration the frequency of words which makes it a bad algorithm.
TF-IDF (Term Frequency – Inverse Document Frequency) is another way to vectorize words that takes word frequencies into consideration. For example, common words such as “we”, “our”, “the” are in every document/instance hence the BoW value will be too high and hence misleading. This will lead to a bad model. TF-IDF is the multiplication of Term Frequency and Inverse Document Frequency.
Term Frequency takes into account the frequency of words in a document and Inverse Document Frequency takes into account the words that are present across the whole corpus. The words that are present across the whole corpus have reduced importance as the IDF value is a lot lower. The words that are present specifically in one document have a high IDF value which makes the total TF-IDF value high.
## TFidf Vectorizer from sklearn.feature_extraction.text import TfidfVectorizer tfidf_v = TfidfVectorizer(max_features=5000,ngram_range=(1,3)) X=tfidf_v.fit_transform(corpus).toarray()y=messages[‘label’] |
In the above code, we import the TF-IDF Vectorizer from Sklearn’s feature extraction module. We make its object by passing max_features as 5000 and ngram_range as (1,3). The parameter max_features define the maximum number of feature vectors that we want to create and the ngram_range parameter defines the ngram combinations we want to include. In our case, we will get 3 combinations of 1 word, 2 words, and 3 words. Let’s take a look at some of the features created.
tfidf_v.get_feature_names()[:20] |
As we can see, there are multiple types of combinations formed. There are feature names with 1 token, 2 tokens, and also with 3 tokens.
Making a Dataframe
## Divide the dataset into Train and Test from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0) count_df = pd.DataFrame(X_train, columns=tfidf_v.get_feature_names()) count_df.head() |
We split the data set into train and test so that we can test the model’s performance on unseen data. We then make a new Dataframe that contains the new feature vectors in it.
Modelling & Tuning
MultinomialNB Algorithm
First, we use the Multinomial Naive Bayes theorem which is the most common and easiest algorithm preferred for text data classification. We fit on the training data and predict on the test data. Later we calculate & plot the confusion matrix and get an accuracy of 88.1%.
from sklearn.naive_bayes import MultinomialNB from sklearn import metrics import numpy as np import itertools from sklearn.metrics import plot_confusion_matrix classifier=MultinomialNB() classifier.fit(X_train, y_train) pred = classifier.predict(X_test) score = metrics.accuracy_score(y_test, pred) print(“accuracy: %0.3f” % score) cm = metrics.confusion_matrix(y_test, pred) plot_confusion_matrix(cm, classes=[‘FAKE’, ‘REAL’]) |
Multinomial Classifier with Hyperparameter Tuning
MultinomialNB has a parameter alpha that can be tuned further. Hence we run a loop to try out multiple MultinomialNB classifiers with different alpha values and check their accuracy scores. And we check if the current score is more than the previous score. If it is, then we set the classifier as the current one.
previous_score=0 for alpha in np.arange(0,1,0.1): sub_classifier=MultinomialNB(alpha=alpha) sub_classifier.fit(X_train,y_train) y_pred=sub_classifier.predict(X_test) score = metrics.accuracy_score(y_test, y_pred) if score>previous_score: classifier=sub_classifier print(“Alpha: {}, Score : {}”.format(alpha,score)) |
Hence we can see that an alpha value of 0.9 or 0.8 gave the highest accuracy score.
Interpreting the Results
Now let’s see what these classifier coefficient values mean. We’ll first save all the feature names in another variable.
## Get Features names feature_names = cv.get_feature_names() |
Now, when we sort the values in reverse order, we get values with a minimum value of -4. These denote the words that are most real or least fake.
### Most real sorted(zip(classifier.coef_[0], feature_names), reverse=True)[:20] |
When we sort the values in non-reverse order, we get values with a minimum value of -10. These denote the words that are least real or most fake.
### Most real sorted(zip(classifier.coef_[0], feature_names))[:20] |
Popular AI and ML Blogs & Free Courses
Conclusion
In this tutorial, we used ML algorithms only but you use other neural networks methods as well. Moreover, to vectorize the text data, we used the TF-IDF vectorizer. There more vectorizers like Count Vectorizer, Hashing Vectorizer, etc. as well which can be better in doing the job. Do try out and experiment with other algorithms and techniques to see if you can produce better results or not.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s Executive PG Programme in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.
Why is there a need to detect fake news?
In their current condition, social media platforms are highly powerful and valuable since they allow users to discuss and exchange ideas as well as debate subjects such as democracy, education, and health. However, certain entities utilize such platforms badly, for monetary gain in some circumstances and for producing prejudiced viewpoints, altering mindsets, and disseminating satire or ridiculousness in others. Fake news is the term for this phenomenon. The proliferation of posting items online that do not adhere to reality has resulted in a slew of issues in politics, sports, health, science, and other fields.
Which companies majorly make use of fake news detection?
Fake news detection is used on platforms such as social media and news websites. Social media behemoths like Facebook, Instagram, and Twitter are vulnerable to fake news since the majority of its users rely on them as daily news sources to get the most up-to-date information. Fake detection techniques are also used by media companies to determine the authenticity of the information they have. Email is another medium through which individuals can receive news, which makes it difficult to identify and verify their veracity. Hoaxes, spam, and junk mail are well-known for being transmitted over email. As a result, the majority of emailing platforms employ false news detection to identify spam and junk mail.
What is Bag of Words or BoW?
A bag-of-words concept (BoW) is a method of extracting text attributes for use in modeling, such as Machine Learning techniques. The method is straightforward and adaptable, and it may be used to extract information from texts in a variety of ways. A bag-of-words is a text representation that specifies the frequency of words appearing in a document. It entails two components: a lexicon of recognized terms and a measure of their presence. Because all information about the sequence or structure of words in the text is deleted, it is referred to as a BAG of words. The model cares about whether or not recognized terms appear in the document, not where they appear.
