Statistics and Probability form the core of Machine Learning and Data Science. It is the statistical analysis coupled with computing power and optimization that Machine Learning is capable of achieving what it’s achieving today. From the basics of probability to descriptive and inferential statistics, these topics make the base of Machine Learning.

By the end of this tutorial, you will know the following:

- Probability Basics
- Probability Distributions
- Normal Distribution
- Measures of Central Tendency
- Central Limit Theorem
- Standard Deviation & Standard Error
- Skewness & Kurtosis

**Probability Basics**

**Independent and Dependent events**

Let’s consider 2 events, event A and event B. When the probability of occurrence of event A doesn’t depend on the occurrence of event B, then A and B are independent events. For eg., if you have 2 fair coins, then the probability of getting heads on both the coins will be 0.5 for both. Hence the events are independent.

Now consider a box containing 5 balls — 2 black and 3 red. The probability of drawing a black ball first will be 2/5. Now the probability of drawing a black ball again from the remaining 4 balls will be 1/4. In this case, the two events are dependent as the probability of drawing a black ball for the second time depends on what ball was drawn on the first go.

**Marginal Probability**

It’s the probability of an event irrespective of the outcomes of other random variables, e.g. P(A) or P(B).

**Joint Probability**

It’s the probability of two different events occurring at the same time, i.e., two (or more) simultaneous events, e.g. P(A and B) or P(A, B).

**Conditional Probability**

It’s the probability of one (or more) events, given the occurrence of another event or in other words, it is the probability of an event A occurring when a secondary event B is true. e.g. P(A given B) or P(A | B).

**Probability Distributions**

Probability Distributions depict the distribution of data points in a sample space. It helps us see the probability of sampling certain data points when sampled at random from the population. For example, if a population consists of marks of students of a school, then the probability distribution will have Marks on the X-axis and the number of students with those marks on the Y-axis. This is also called a **Histogram**. The histogram is a type of **Discrete Probability Distribution**. The main types of Discrete Distribution are Binomial Distribution, Poisson Distribution and Uniform Distribution.

On the other hand, a **Continuous Probability Distribution** is made for data that has continuous value. In other words, when it can have an infinite set of values like height, speed, temperature, etc. Continuous Probability Distributions have tremendous use in Data Science and statistical analysis for checking feature importance, data distributions, statistical tests, etc.

Also Read the mathematics behind machine learning

**Normal Distribution**

The most well-known continuous distribution is Normal Distribution, which is also known as the Gaussian distribution or the “Bell Curve.”

Consider a normal distribution of heights of people. Most of the heights are clustered in the middle part which is taller and gradually reduces towards left and right extremes which denote a lower probability of getting that value randomly.

This curve is centred at its mean and can be tall and slim or it can be short and spread out. A slim one denotes that there is less number of distinct values that we can sample. And a more spread out curve shows that there is a larger range of values. This spread is defined by its **Standard Deviation**.

Greater the Standard Deviation, more spread will be your data. Standard Deviation is just a mathematical derivation of another property called the Variance, which defines how much the data ‘varies’. And variance is what data is all about, Variance is information. No Variance, no information. The Normal Distribution has a crucial role in stats – The Central Limit Theorem.

**Measures of Central Tendency**

Measures of Central Tendency are the ways by which we can summarize a dataset by taking a single value. There are 3 Measures of Tendency mainly:

**1. Mean:** The mean is just the arithmetic mean or the average of the values in the data/feature. Sum of all values divided by the number of values gives us the mean. Mean is usually the most common way to measure the centre of any data, but can be misleading in some cases. For example, when there are a lot of outliers, the mean will start to shift towards the outliers and be a bad measure of the centre of your data.

**2. Median**: Median is the data point that lies exactly in the centre when the data is sorted in increasing or decreasing order. When the number of data points is odd, then the median is easily picked as the centre most point. When the number of data points is even, then the median is calculated as the mean of the 2 centre most data points.

**3. Mode:** Mode is the data point that is most frequently present in a dataset. The mode remains most robust to outliers as it will still remain fixed at the most frequent point.

**Central Limit Theorem**

The central limit theorem in statistics states that, given a sufficiently large sample size, the sampling distribution will approximate a normal distribution regardless of that variable’s distribution. Let me bring the essence of the above statement in plain words.

The data might be of any distribution. It could be perfect or skewed normal, it could be exponential or (almost) any distribution you may think of. However, if you repeatedly take samples from the population and keep plotting the histogram of their means, you will eventually find that this new distribution of all the means resembles the Normal Distribution!

In essence, it doesn’t matter what distribution your data is in, the distribution of their means will always be normal.

But how many samples are needed to hold CLT true? The thumb rule says that it should be >30. So if you take 30 or more samples from any distribution, the means will be normally distributed no matter the underlying distribution type.

**Standard Deviation & Standard Error**

Standard Deviation and Standard Error are often confused with one another. Standard Deviation, as you might know, describes or quantifies the variation in the data on both sides of the distribution – lower than mean and greater than mean. If your data points are spread across a large range of values, the standard deviation will be high.

Now, as we discussed above, by Central Limit Theorem, if we plot the means of all the samples from a population, the distribution of those means will again be a normal distribution. So it will have its own standard deviation, right?

The standard deviation of the means of all samples from a population is called Standard Error. The value of Standard Error will be usually less than the Standard Deviation as you are calculating the standard deviation of means, and the value of means would be less spread than individual data points due to aggregation.

You can even calculate the standard deviation of medians, mode or even standard deviation of standard deviations!

**Before You Go**

Statistical concepts form the real core of Data Science and ML. To be able to make valid deductions and understand the data at hand effectively, you need to have a solid understanding of the statistical and probability concepts discussed in this tutorial.

upGrad provides a PG Diploma in Machine Learning and AI and a Master of Science in Machine Learning & AI that may guide you toward building a career. These courses will explain the need for Machine Learning and further steps to gather knowledge in this domain covering varied concepts ranging from Gradient Descent to Machine Learning.