Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconStatistics for Machine Learning: Everything You Need to Know

Statistics for Machine Learning: Everything You Need to Know

Last updated:
22nd Jun, 2023
Read Time
9 Mins
share image icon
In this article
Chevron in toc
View All
Statistics for Machine Learning: Everything You Need to Know

Statistics and Probability form the core of Machine Learning and Data Science. It is the statistical analysis coupled with computing power and optimization that Machine Learning is capable of achieving what it’s achieving today. From the basics of probability to descriptive and inferential statistics, these topics make the base of Machine Learning.

Top Machine Learning and AI Courses Online

By the end of this tutorial, you will know the following:

  • Probability Basics
  • Probability Distributions
  • Normal Distribution
  • Measures of Central Tendency
  • Central Limit Theorem
  • Standard Deviation & Standard Error
  • Skewness & Kurtosis

Probability Basics

Independent and Dependent events

Ads of upGrad blog

Let’s consider 2 events, event A and event B. When the probability of occurrence of event A doesn’t depend on the occurrence of event B, then A and B are independent events. For eg., if you have 2 fair coins, then the probability of getting heads on both the coins will be 0.5 for both. Hence the events are independent.

Trending Machine Learning Skills

Now consider a box containing 5 balls — 2 black and 3 red. The probability of drawing a black ball first will be 2/5. Now the probability of drawing a black ball again from the remaining 4 balls will be 1/4. In this case, the two events are dependent as the probability of drawing a black ball for the second time depends on what ball was drawn on the first go.

Marginal Probability

It’s the probability of an event irrespective of the outcomes of other random variables, e.g. P(A) or P(B).

Joint Probability

It’s the probability of two different events occurring at the same time, i.e., two (or more) simultaneous events, e.g. P(A and B) or P(A, B).

Conditional Probability

It’s the probability of one (or more) events, given the occurrence of another event or in other words, it is the probability of an event A occurring when a secondary event B is true. e.g. P(A given B) or P(A | B).

Join the ML Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.

Probability Distributions

Probability Distributions depict the distribution of data points in a sample space. It helps us see the probability of sampling certain data points when sampled at random from the population. For example, if a population consists of marks of students of a school, then the probability distribution will have Marks on the X-axis and the number of students with those marks on the Y-axis. This is also called a Histogram. The histogram is a type of Discrete Probability Distribution. The main types of Discrete Distribution are Binomial Distribution, Poisson Distribution and Uniform Distribution. 

On the other hand, a Continuous Probability Distribution is made for data that has continuous value. In other words, when it can have an infinite set of values like height, speed, temperature, etc. Continuous Probability Distributions have tremendous use in Data Science and statistical analysis for checking feature importance, data distributions, statistical tests, etc. 

In addition to the previously stated discrete probability distributions (binomial, poisson, and uniform), a few more significant discrete probability distributions are often employed in statistics for machine learning.

The Bernoulli distribution is a finite probability distribution that indicates a binary outcome in which the random variable used has only two possible values, often labeled as 0 and 1. It is typically employed to define the possibility of success or failure in a single test.

The geometric distribution is applied to determine the number of trials necessary to get the initial favorable outcome in an arrangement of different Bernoulli trials with a uniform chance for accuracy across trials.

The negative binomial distribution simulates the total number of trials required to attain an appropriate number of successes in a sequence of autonomous Bernoulli trials. The geometric distribution is generalized through the provision for a variable number of successes.

Also Read the mathematics behind machine learning 

Normal Distribution

The most well-known continuous distribution is Normal Distribution, which is also known as the Gaussian distribution or the “Bell Curve.”

Consider a normal distribution of heights of people. Most of the heights are clustered in the middle part which is taller and gradually reduces towards left and right extremes which denote a lower probability of getting that value randomly.

This curve is centred at its mean and can be tall and slim or it can be short and spread out. A slim one denotes that there is less number of distinct values that we can sample. And a more spread out curve shows that there is a larger range of values. This spread is defined by its Standard Deviation.

Greater the Standard Deviation, more spread will be your data. Standard Deviation is just a mathematical derivation of another property called the Variance, which defines how much the data ‘varies’. And variance is what data is all about, Variance is information. No Variance, no information. The Normal Distribution has a crucial role in stats – The Central Limit Theorem.

It is important to mention that normal distribution is essential to statistical learning in AI. Many methods for statistical learning in machine learning algorithms assume or attempt to approximate the normal distribution.

The 68-95-99.7 rule, commonly known as the empirical standard or the three-sigma rule, is an essential characteristic of the normal distribution. According to the report, around 68% of information lies within one standard deviation of the mean, 95% is between two standard deviations, and 99.7% is within three standard deviations. This rule is a valuable guideline regarding comprehending data distribution and spotting outliers.

Measures of Central Tendency

Measures of Central Tendency are the ways by which we can summarize a dataset by taking a single value. There are 3 Measures of Tendency mainly:

1. Mean: The mean is just the arithmetic mean or the average of the values in the data/feature. Sum of all values divided by the number of values gives us the mean. Mean is usually the most common way to measure the centre of any data, but can be misleading in some cases. For example, when there are a lot of outliers, the mean will start to shift towards the outliers and be a bad measure of the centre of your data.

2. Median: Median is the data point that lies exactly in the centre when the data is sorted in increasing or decreasing order. When the number of data points is odd, then the median is easily picked as the centre most point. When the number of data points is even, then the median is calculated as the mean of the 2 centre most data points. 

3. Mode: Mode is the data point that is most frequently present in a dataset. The mode remains most robust to outliers as it will still remain fixed at the most frequent point.

In addition to the mean, median, and mode, additional metrics of central tendency that might give insights into the data should be included.

4. Weighted Mean: When distinct data points have varied weights or relevance, the weighted mean is used. It is determined by multiplying every single value by its corresponding weight and then dividing the sum of these weighted numbers by the total weights.

5. Trimmed Mean: A trimmed mean is a mean variation that decreases the impact of outliers on estimation. Before computing the mean of the remaining numbers, a fixed percentage of the highest and lowest figures is removed. When the data contains severe outliers that greatly distort the mean, the trimmed mean is beneficial.

Central Limit Theorem

The central limit theorem in statistics states that, given a sufficiently large sample size, the sampling distribution will approximate a normal distribution regardless of that variable’s distribution. Let me bring the essence of the above statement in plain words.

The data might be of any distribution. It could be perfect or skewed normal, it could be exponential or (almost) any distribution you may think of. However, if you repeatedly take samples from the population and keep plotting the histogram of their means, you will eventually find that this new distribution of all the means resembles the Normal Distribution!

In essence, it doesn’t matter what distribution your data is in, the distribution of their means will always be normal.

But how many samples are needed to hold CLT true? The thumb rule says that it should be >30. So if you take 30 or more samples from any distribution, the means will be normally distributed no matter the underlying distribution type. 

When it involves hypothesis testing and estimating parameter values, the Central Limit Theorem has major ramifications. Many statistical tests and estimation procedures are based on the presumption of a regularly distributed sample distribution, which is frequently obtained thanks to the Central Limit Theorem. Based on sample statistics, we can draw reasonable predictions about the parameters of the population.

Standard Deviation & Standard Error

Standard Deviation and Standard Error are often confused with one another. Standard Deviation, as you might know, describes or quantifies the variation in the data on both sides of the distribution – lower than mean and greater than mean. If your data points are spread across a large range of values, the standard deviation will be high.

Now, as we discussed above, by Central Limit Theorem, if we plot the means of all the samples from a population, the distribution of those means will again be a normal distribution. So it will have its own standard deviation, right?

The standard deviation of the means of all samples from a population is called Standard Error. The value of Standard Error will be usually less than the Standard Deviation as you are calculating the standard deviation of means, and the value of means would be less spread than individual data points due to aggregation.

You can even calculate the standard deviation of medians, mode or even standard deviation of standard deviations!

Ads of upGrad blog

Popular AI and ML Blogs & Free Courses

Before You Go

Statistical concepts form the real core of Data Science and ML. To be able to make valid deductions and understand the data at hand effectively, you need to have a solid understanding of the statistical and probability concepts discussed in this tutorial.

upGrad provides a Executive PG Programme in Machine Learning & AI and a  Master of Science in Machine Learning & AI that may guide you toward building a career. These courses will explain the need for Machine Learning and further steps to gather knowledge in this domain covering varied concepts ranging from Gradient Descent to Machine Learning.


Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Select Coursecaret down icon
Selectcaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Frequently Asked Questions (FAQs)

1Is knowledge of statistics mandatory for doing well in machine learning?

Statistics is a very vast field. In machine learning, statistics basically help in understanding the data deeply. Some statistical concepts like probability, data interpretation, etc. are needed in several machine learning algorithms. However, you do not have to be an expert on all the topics of statistics to do well in machine learning. By knowing just the fundamental concepts, you will be able to perform efficiently.

2Will knowing some coding beforehand be helpful in machine learning?

Coding is the heart of machine learning, and programmers who understand how to code well will have a deep understanding of how the algorithms function and, thus, will be able to monitor and optimize those algorithms more effectively. You do not need to be an expert in any programming language, although any prior knowledge will be beneficial. If you are a beginner, Python is a good choice since it is simple to learn and has a user-friendly syntax.

3How do we use calculus in everyday life?

Weather forecasts are based on a number of variables, such as wind speed, moisture content, and temperature, which can only be calculated using calculus. The use of calculus may also be seen in aviation engineering in a variety of ways. Calculus is also used by vehicle industries to improve and ensure good safety of the vehicles. It is also used by credit card companies for payment purposes.

Explore Free Courses

Suggested Blogs

Artificial Intelligence course fees
Artificial intelligence (AI) was one of the most used words in 2023, which emphasizes how important and widespread this technology has become. If you
Read More

by venkatesh Rajanala

29 Feb 2024

Artificial Intelligence in Banking 2024: Examples & Challenges
Introduction Millennials and their changing preferences have led to a wide-scale disruption of daily processes in many industries and a simultaneous g
Read More

by Pavan Vadapalli

27 Feb 2024

Top 9 Python Libraries for Machine Learning in 2024
Machine learning is the most algorithm-intense field in computer science. Gone are those days when people had to code all algorithms for machine learn
Read More

by upGrad

19 Feb 2024

Top 15 IoT Interview Questions & Answers 2024 – For Beginners & Experienced
These days, the minute you indulge in any technology-oriented discussion, interview questions on cloud computing come up in some form or the other. Th
Read More

by Kechit Goyal

19 Feb 2024

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow
Summary: In this article, you will learn about data preprocessing in Machine Learning: 7 easy steps to follow. Acquire the dataset Import all the cr
Read More

by Kechit Goyal

18 Feb 2024

Artificial Intelligence Salary in India [For Beginners & Experienced] in 2024
Artificial Intelligence (AI) has been one of the hottest buzzwords in the tech sphere for quite some time now. As Data Science is advancing, both AI a
Read More

by upGrad

18 Feb 2024

24 Exciting IoT Project Ideas & Topics For Beginners 2024 [Latest]
Summary: In this article, you will learn the 24 Exciting IoT Project Ideas & Topics. Take a glimpse at the project ideas listed below. Smart Agr
Read More

by Kechit Goyal

18 Feb 2024

Natural Language Processing (NLP) Projects & Topics For Beginners [2023]
What are Natural Language Processing Projects? NLP project ideas advanced encompass various applications and research areas that leverage computation
Read More

by Pavan Vadapalli

17 Feb 2024

45+ Interesting Machine Learning Project Ideas For Beginners [2024]
Summary: In this Article, you will learn Stock Prices Predictor Sports Predictor Develop A Sentiment Analyzer Enhance Healthcare Prepare ML Algorith
Read More

by Jaideep Khare

16 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
footer sticky close icon