There are so many instances when you are working on machine learning (ML), deep learning (DL), mining data from a set of data, programming on Python, or doing natural language processing (NLP) in which you are required to differentiate discrete objects based on specific attributes. A classifier is a machine learning model used for the purpose. The Naive Bayes Classifier is the crux of this blog post which we will learn further.
The British mathematician Reverend Thomas Bayes, Bayes‘ theorem is a mathematical formula used to determine the conditional probability, which is the likelihood of an outcome occurring based on a previous outcome.
Using this formula, we can find the probability of A when B has occurred.
A is the proposition;
B is the evidence;
P(A) is the prior probability of proposition;
P(B) is the prior probability of evidence;
P(A/B) is called the posterior and
P(B/A) is called the likelihood.
Posterior = (Likelihood)(Proposition in prior probability)
Evidence Prior probability
This formula assumes that the predictors or features are independent, and one’s presence does not affect another’s feature. Hence, it is called ‘naïve.’
Example Displaying Naïve Bayes Classifier
We are taking an example of a better understanding of the topic.
We are creating a classifier that depicts if a text is about sports or not.
The training data has five sentences:
|“A great game”||Sports|
|“The election was over”||Not sports|
|“Very clean match”||Sports|
|“It was a close election”||Not sports|
|“A clean but forgettable game”||Sports|
Here, you need to find the sentence ‘A very close game’ is of which label?
Naive Bayes, as a classifier, calculates the probability of the sentence “A very close game” is Sports with the probability ‘Not Sports.’
Mathematically, we want to know P (Sports | a very close game), probability of the label Sports in the sentence “A very close game.”
Now, the next step is calculating the probabilities.
But before that, let’s take a look at some concepts.
We need to first determine the features to use while a machine learning model creation. Features are the chunks of information from the text given to the algorithm.
In the above example, we have data as text. So, we need to convert the text into numbers in which we will perform calculations.
Hence, instead of text, we will use the frequencies of the words occurring in the text. The features will be the number of these words.
Applying Bayes’ Theorem
We will convert the probability to be calculated using the count of the frequency of words. For this, we will use Bayes’ Theorem and some basic concepts of probability.
P(A/B) = P(B/A) x P(A)
We have P (Sports | a very close game), and by using Bayes theorem, we will countermand the conditional probability:
P (sports/ a very close game) = P(a very close game/ sports) x P(sports)
P (a very close game)
We will abandon the divisor same for both the labels and compare
P(a very close game/ Sports) x P(Sports)
P(a very close game/ Not Sports) x P(Not Sports)
We can calculate the probabilities by calculating the counts the sentence “A very close game” emerges in the label ‘Sports’. To determine P (a very close game | Sports), divide it by the total.
But, in the training data, ‘A very close game’ doesn’t seem anywhere so this probability is zero.
The model won’t be of much use without every sentence we want to classify is present in the training data.
Naïve Bayes Classifier
Now comes the core part here, ‘Naïve.’ Every word in a sentence is independent of the other, we’re not looking at the entire sentences, but at single words. Learn more about naive bayes classifier.
P(a very close game) = P(a) x P(very) x P(close) x P(game)
This presumption is powerful and useful too. The subsequent step is to apply:
P(a very close game/Sports) = P(a/Sports) x P(very/Sports) x P(close/Sports) x P(game/Sports)
These individual words appear many times in the training data that we can compute.
The finishing step is to calculate the probabilities and look at which one is larger.
First, we calculate the a priori probability of the labels: for the sentences in the given training data. The probability of it being Sports P (Sports) will be ⅗, and P (Not Sports) will be ⅖.
While calculating P (game/ Sports), we count the times the word “game” appears in Sports text (here 2) divided by the words in sports (11).
P(game/Sports) = 2/11
But, the word “close” isn’t present in any Sports text!
This means P (close | Sports) = 0 and is inconvenient as we will multiply it with other probabilities,
P(a/Sports) x P(very/Sports) x 0 x P(game/Sports)
The end result will be 0, and the entire calculation will be nullified. But this is not what we want, so we seek some other way around.
We can eliminate the above issue with Laplace smoothing, where we will sum up 1 to every count; so that it is never zero.
We will add the possible number words to the divisor, and the division will not be more than 1.
In this case, the set of possible words are
[‘a’, ‘great’, ‘very’, ‘over’, ‘it’, ‘but’, ‘game’, ‘match’, ‘clean’, ‘election’, ‘close’, ‘the’, ‘was’, ‘forgettable’].
The possible number of words is 14; by applying Laplace smoothing,
P(game/Sports) = 2+1
11 + 14
|Word||P (word | Sports)||P (word | Not Sports)|
|a||(2 + 1) ÷ (11 + 14)||(1 + 1) ÷ (9 + 14)|
|very||(1 + 1) ÷ (11 + 14)||(0 + 1) ÷ (9 + 14)|
|close||(0 + 1) ÷ (11 + 14)||(1 + 1) ÷ (9 + 14)|
|game||(2 + 1) ÷ (11 + 14)||(0 + 1) ÷ (9 + 14)|
Now, multiplying all the probabilities to find which is bigger:
P(a/Sports) x P(very/Sports) x P(game/Sports)x P(game/Sports)x P(Sports)
= 2.76 x 10 ^-5
P(a/Non Sports) x P(very/ Non Sports) x P(game/ Non Sports)x P(game/ Non Sports)x P(Non Sports)
= 0.572 x 10 ^-5
Hence, we have finally got our classifier that gives “A very close game” the label Sports as its probability is high and we infer that the sentence belongs to the Sports category.
Checkout: Machine Learning Models Explained
Types of Naive Bayes Classifier
Now that we have understood what a Naïve Bayes Classifier is and have seen an example too, let’s see the types of it:
1. Multinomial Naive Bayes Classifier
This is used mostly for document classification problems, whether a document belongs to the categories such as politics, sports, technology, etc. The predictor used by this classifier is the frequency of the words in the document.
2. Bernoulli Naive Bayes Classifier
This is similar to the multinomial Naive Bayes Classifier, but its predictors are boolean variables. The parameters we use to predict the class variable take up the values yes or no only. For instance, whether a word occurs in a text or not.
3. Gaussian Naive Bayes Classifier
When the predictors take a constant value, we assume that these values are sampled from a Gaussian distribution.
Since the values present in the dataset changes, the conditional probability formula changes to,
We hope we could guide you on what Naive Bayes Classifier is and how it is used to classify text. This simple method works wonders in classification problems. Whether you’re a Machine Learning expert or not, you can build your own Naive Bayes Classifier without spending hours on coding.
If you’re interested in learning more, check out Upgrad’s exclusive programs in machine learning. Learning Classifiers with upGrad: Give a boost to your career with the knowledge of machine learning and your Deep learning skills. At upGrad Education Pvt. Ltd., we offer a certification program carefully designed and mentored by industry experts.
- This intensive 240+ hours course is specially designed for working professionals.
- You will work on more than five industry projects and case studies.
- You will receive 360-degree career support with a dedicated student success mentor and career mentor.
- You will get assistance for your placement and learn to build a strong resume.