Table of Contents
Introduction
What is Bayes Theorem?
Bayes’s theorem is used for the calculation of a conditional probability where intuition often fails. Although widely used in probability, the theorem is being applied in the machine learning field too. Its use in machine learning includes the fitting of a model to a training dataset and developing classification models.
What is conditional probability?
A conditional probability is usually defined as the probability of one event given the occurrence of another event.
 If A and B are two events, then the conditional probability me be designated as P(A given B) or P(AB).
 Conditional probability can be calculated from the joint probability(A  B) = P(A, B) / P(B)
 The conditional probability is not symmetrical; For example P(A  B) != P(B  A)
Other ways of calculating conditional probability includes using the other conditional probability, i.e.
P(AB) = P(BA) * P(A) / P(B)
Reverse is also used
P(BA) = P(AB) * P(B) / P(A)
This way of calculation is useful when it is challenging to calculate the joint probability. Else, when the reverse conditional probability is available, calculation through this becomes easy.
This alternate calculation of conditional probability is referred to as the Bayes Rule or Bayes Theorem. It is named under the person who first described it, “Reverend Thomas Bayes”.
The Formula of Bayes theorem
Bayes theorem is a way of calculating conditional probability when the joint probability is not available. Sometimes, the denominator can’t be directly accessed. In such cases, the alternative way of calculating is as:
P(B) = P(BA) * P(A) + P(Bnot A) * P(not A)
This is the formulation of the Bayes theorem which shows an alternate calculation of P(B).
P(AB) = P(BA) * P(A) / P(BA) * P(A) + P(Bnot A) * P(not A)
The above formula can be described with brackets around the denominator
P(AB) = P(BA) * P(A) / (P(BA) * P(A) + P(Bnot A) * P(not A))
Also, if we have P(A), then the P(not A) can be calculated as
P(not A) = 1 – P(A)
Similarly, if we have P(not Bnot A),then P(Bnot A) can be calculated as
P(Bnot A) = 1 – P(not Bnot A)
Bayes Theorem of Conditional Probability
Bayes Theorem consists of several terms whose names are given based on the context of its application in the equation.
Posterior probability refers to the result of P(AB), and prior probability refers to P(A).
 P(AB): Posterior probability.
 P(A): Prior probability.
Similarly, P(BA) and P(B) is referred to as the likelihood and evidence.
 P(BA): Likelihood.
 P(B): Evidence.
Therefore, the Bayes theorem of conditional probability can be restated as:
Posterior = Likelihood * Prior / Evidence
If we have to calculate the probability that there is fire given that there is smoke, then the following equation will be used:
P(FireSmoke) = P(SmokeFire) * P(Fire) / P(Smoke)
Where, P(Fire) is the Prior, P(SmokeFire) is the Likelihood, and P(Smoke) is the evidence.
An Illustration of Bayes theorem
A Bayes theorem example is described to illustrate the use of Bayes theorem in a problem.
Problem
Three boxes labeled as A, B, and C, are present. Details of the boxes are:
 Box A contains 2 red and 3 black balls
 Box B contains 3 red and 1 black ball
 And box C contains 1 red ball and 4 black balls
All the three boxes are identical having an equal probability to be picked up. Therefore, what is the probability that the red ball was picked up from box A?
Solution
Let E denote the event that a red ball is picked up and A, B and C denote that the ball is picked up from their respective boxes. Therefore the conditional probability would be P(AE) which needs to be calculated.
The existing probabilities P(A) = P(B) = P (C) = 1 / 3, since all boxes have equal probability of getting picked.
P(EA) = Number of red balls in box A / Total number of balls in box A = 2 / 5
Similarly, P(EB) = 3 / 4 and P(EC) = 1 / 5
Then evidence P(E) = P(EA)*P(A) + P(EB)*P(B) + P(EC)*P(C)
= (2/5) * (1/3) + (3/4) * (1/3) + (1/5) * (1/3) = 0.45
Therefore, P(AE) = P(EA) * P(A) / P(E) = (2/5) * (1/3) / 0.45 = 0.296
Example of Bayes Theorem
Bayes theorem gives the probability of an “event” with the given information on “tests”.
 There is a difference between “events” and “tests”. For example there is a test for liver disease, which is different from actually having the liver disease, i.e. an event.
 Rare events might be having a higher false positive rate.
Example 1
What is the probability of a patient having liver disease if they are alcoholic?
Here, “being an alcoholic” is the “test” (type of litmus test) for liver disease.
 A is the event i.e. “patient has liver disease””.
As per earlier records of the clinic, it states that 10% of the patient’s entering the clinic are suffering from liver disease.
Therefore, P(A)=0.10
 B is the litmus test that “Patient is an alcoholic”.
Earlier records of the clinic showed that 5% of the patients entering the clinic are alcoholic.
Therefore, P(B)=0.05
 Also, 7% out of the he patient’s that are diagnosed with liver disease, are alcoholics. This defines the BA: probability of a patient being alcoholic, given that they have a liver disease is 7%.
As, per Bayes theorem formula,
P(AB) = (0.07 * 0.1)/0.05 = 0.14
Therefore, for a patient being alcoholic, the chances of having a liver disease are 0.14 (14%).
Example2
 Dangerous fires are rare (1%)
 But smoke is fairly common (10%) due to barbecues,
 And 90% of dangerous fires make smoke
What is the probability of dangerous Fire when there is Smoke?
Calculation
P(FireSmoke) =P(Fire) P(SmokeFire)/P(Smoke)
= 1% x 90%/10%
= 9%
Example 3
What is the chance of rain during the day? Where, Rain means rain during the day, and Cloud means cloudy morning.
The chance of Rain given Cloud is written P(RainCloud)
P(RainCloud) = P(Rain) P(CloudRain)/P(Cloud)
P(Rain) is Probability of Rain = 10%
P(CloudRain) is Probability of Cloud, given that Rain happens = 50%
P(Cloud) is Probability of Cloud = 40%
P(RainCloud) = 0.1 x 0.5/0.4 = .125
Therefore, a 12.5% chance of rain.
Applications
Several applications of Bayes theorem exist in the real world. The few main applications of the theorem are:
1. Modelling Hypotheses
The Bayes theorem finds wide application in the applied machine learning and establishes a relationship between the data and a model. Applied machine learning uses the process of testing and analysis of different hypotheses on a given dataset.
To describe the relationship between the data and the model, the Bayes theorem provides a probabilistic model.
P(hD) = P(Dh) * P(h) / P(D)
Where,
P(hD): Posterior probability of the hypothesis
P(h): Prior probability of the hypothesis.
An increase in P(D) decreases the P(hD). Conversely, if P(h) and the probability of observing the data given hypothesis increases, then the probability of P(hD) increases.
2. Bayes Theorem for Classification
The method of classification involves the labelling of a given data. It can be defined as the calculation of the conditional probability of a class label given a data sample.
P(classdata) = (P(dataclass) * P(class)) / P(data)
Where P(classdata) is the probability of class given the provided data.
The calculation can be carried out for each class. The class having the largest probability can be assigned to the input data.
Calculation of the conditional probability is not feasible under conditions of a small number of examples. Therefore, the direct application of the Bayes theorem is not feasible. A solution to the classification model lies in the simplified calculation.
Bayes theorem considers that input variables are dependent on other variables which cause the complexity of calculation. Therefore, the assumption is removed and every input variable is considered as an independent variable. As a result the model changes from dependent to independent conditional probability model. It ultimately reduces the complexity.
This simplification of the Bayes theorem is referred to as the Naïve Bayes. It is widely used for classification and predicting models.

Bayes Optimal Classifier
This is a type of probabilistic model that involves the prediction of a new example given the training dataset. One example of the Bayes Optimal Classifier is “What is the most probable classification of the new instance given the training data?”
Calculation of the conditional probability of a new instance given the training data can be done through the following equation
P(vj  D) = sum {h in H} P(vj  hi) * P(hi  D)
Where vj is a new instance to be classified,
H is the set of hypotheses for classifying the instance,
hi is a given hypothesis,
P(vj  hi) is the posterior probability for vi given hypothesis hi, and
P(hi  D) is the posterior probability of the hypothesis hi given the data D.
3. Uses of Bayes theorem in Machine learning
The most common application of the Bayes theorem in machine learning is the development of classification problems. Other applications rather than the classification include optimization and casual models.

Bayesian optimization
It is always a challengeable task to find an input that results in the minimum or maximum cost of a given objective function. Bayesian optimization is based on the Bayes theorem and provides an aspect for the search of a global optimization problem. The method includes the building of a probabilistic model (surrogate function), searching through an acquisition function, and the selection of candidate samples for evaluating the real objective function.
In applied machine learning, Bayesian optimization is used to tune the hyperparameters of a wellperforming model.

Bayesian Belief networks
Relationships between the variables may be defined through the use of probabilistic models. They are also used for the calculation of probabilities. A fully conditional probability model might not be able to calculate the probabilities due to the large volume of data. Naïve Bayes has simplified the approach for the calculation. Yet another method exists where a model is developed based on the known conditional dependence between random variables and conditional independence in other cases. The Bayesian network displays this dependence and independence through the probabilistic graph model with directed edges. The known conditional dependence is displayed as directed edges and the missing connections represent the conditional independencies in the model.
4. Bayesian Spam filtering
Spam filtering is another application of Bayes theorem. Two events are present:
 Event A: The message is spam.
 Test X: The message contains certain words (X)
With the application of the Bayes theorem, it can be predicted if the message is spam given the “test results”. Analyzing the words in a message can compute the chances of being a spam message. With the training of filters with repeated messages, it updates the fact that the probability of having certain words in the message would be spam.
An application of Bayes theorem with an example
A catalyst producer produces a device for testing defects in a certain electrocatalyst (EC). The catalyst producer claims that the test is 97% reliable if the EC is defective and 99% reliable when it is flawless. However, 4% of said EC may be expected to be defective upon delivery. Bayes rule is applied to ascertain the true reliability of the device. The basic event sets are
A : the EC is defective; A’: the EC is flawless; B: the EC is tested to be defective; B’: the EC is tested to be flawless.
The probabilities would be
B/A: EC is (known to be) defective, and tested defective, P(B/A) = 0.97,
B’/A: EC is (known to be) defective, but tested flawless, P(B’/A)=1P(B/A)=0.03,
B/A’: EC is (known to be) defective, but tested defective, P(B/A’) = 1 P(B’/A’)=0.01
B’/A: = EC is (known to be) flawless, and tested flawless P(B’/A’) = 0.99
The probabilities calculated by Bayes theorem are:
The probability of computation shows that there is a high possibility of rejecting flawless EC’s (about 20%) and a low possibility of identifying defective EC’s (about 80%).
Conclusion
One of the most striking features of a Bayes theorem is that from a few probability ratios, a huge amount of information can be obtained. With the means of likelihood, the probability of a prior event can get transformed to the posterior probability. The approaches of the Bayes theorem can be applied in areas of statistics, epistemology, and inductive logic.
If you’re interested to learn more about Bayes Theorem, AI and machine learning, check out IIITB & upGrad’s Executive PG Program in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIITB Alumni status, 5+ practical handson capstone projects & job assistance with top firms.
What is the hypothesis in machine learning?
In the broadest sense, a hypothesis is any idea or proposition that is to be tested. Hypothesis is a guess. Machine learning is a science of making sense of data, especially data that is too complex for humans and is often characterized by seemingly randomness. When machine learning is being used, a Hypothesis is a set of instructions that the machine uses to analyze a certain data set and look for the patterns that can help us make predictions or decisions. Using machine learning, we are able to make predictions or decisions with the help of algorithms.
What is the most general hypothesis in machine learning?
Most general hypothesis in machine learning is that there is no understanding of the data. Notations and models are just representations of that data, and that data is a complex system. So, it is not possible to have a complete and general understanding of the data. The only way to learn anything about the data is to use it and see how the predictions change with the data. The general hypothesis is that models are only useful in the domains they have been created to work in, and have no general application to realworld phenomena. The general hypothesis is that the data is unique and the process of learning is unique to each problem.
Why must a hypothesis be measurable?
A hypothesis is measurable when a number can be assigned to the qualitative or quantitative variable. This can be done by making an observation or by performing an experiment. For example, if a salesman is trying to sell a product, a hypothesis would be to sell the product to a customer. This hypothesis is measurable if the number of sales is measured in a day or week.