Hypothesis Testing is a broad subject that is applicable to many fields. When we study statistics, the Hypothesis Testing there involves data from multiple populations and the test is to see how significant the effect is on the population.
This involves calculating the p-value and comparing it with the critical value or the alpha. When it comes to Machine Learning, Hypothesis Testing deals with finding the function that best approximates independent features to the target. In other words, map the inputs to the outputs.
By the end of this tutorial, you will know the following:
- What is Hypothesis in Statistics vs Machine Learning
- What is Hypothesis space?
- Process of Forming a Hypothesis
Hypothesis in Statistics
A Hypothesis is an assumption of a result that is falsifiable, meaning it can be proven wrong by some evidence. A Hypothesis can be either rejected or failed to be rejected. We never accept any hypothesis in statistics because it is all about probabilities and we are never 100% certain. Before the start of the experiment, we define two hypotheses:
1. Null Hypothesis: says that there is no significant effect
2. Alternative Hypothesis: says that there is some significant effect
In statistics, we compare the P-value (which is calculated using different types of statistical tests) with the critical value or alpha. The larger the P-value, the higher is the likelihood, which in turn signifies that the effect is not significant and we conclude that we fail to reject the null hypothesis.
In other words, the effect is highly likely to have occurred by chance and there is no statistical significance of it. On the other hand, if we get a P-value very small, it means that the likelihood is small. That means the probability of the event occurring by chance is very low.
Join the ML and AI Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.
The Significance Level is set before starting the experiment. This defines how much is the tolerance of error and at which level can the effect can be considered significant. A common value for significance level is 95% which also means that there is a 5% chance of us getting fooled by the test and making an error. In other words, the critical value is 0.05 which acts as a threshold. Similarly, if the significance level was set at 99%, it would mean a critical value of 0.01%.
A statistical test is carried out on the population and sample to find out the P-value which then is compared with the critical value. If the P-value comes out to be less than the critical value, then we can conclude that the effect is significant and hence reject the Null Hypothesis (that said there is no significant effect). If P-Value comes out to be more than the critical value, we can conclude that there is no significant effect and hence fail to reject the Null Hypothesis.
Now, as we can never be 100% sure, there is always a chance of our tests being correct but the results being misleading. This means that either we reject the null when it is actually not wrong. It can also mean that we don’t reject the null when it is actually false. These are type 1 and type 2 errors of Hypothesis Testing.
Consider you’re working for a vaccine manufacturer and your team develops the vaccine for Covid-19. To prove the efficacy of this vaccine, it needs to statistically proven that it is effective on humans. Therefore, we take two groups of people of equal size and properties. We give the vaccine to group A and we give a placebo to group B. We carry out analysis to see how many people in group A got infected and how many in group B got infected.
We test this multiple times to see if group A developed any significant immunity against Covid-19 or not. We calculate the P-value for all these tests and conclude that P-values are always less than the critical value. Hence, we can safely reject the null hypothesis and conclude there is indeed a significant effect.
Hypothesis in Machine Learning
Hypothesis in Machine Learning is used when in a Supervised Machine Learning, we need to find the function that best maps input to output. This can also be called function approximation because we are approximating a target function that best maps feature to the target.
1. Hypothesis(h): A Hypothesis can be a single model that maps features to the target, however, may be the result/metrics. A hypothesis is signified by “h”.
2. Hypothesis Space(H): A Hypothesis space is a complete range of models and their possible parameters that can be used to model the data. It is signified by “H”. In other words, the Hypothesis is a subset of Hypothesis Space.
Process of Forming a Hypothesis
In essence, we have the training data (independent features and the target) and a target function that maps features to the target. These are then run on different types of algorithms using different types of configuration of their hyperparameter space to check which configuration produces the best results. The training data is used to formulate and find the best hypothesis from the hypothesis space. The test data is used to validate or verify the results produced by the hypothesis.
Consider an example where we have a dataset of 10000 instances with 10 features and one target. The target is binary, which means it is a binary classification problem. Now, say, we model this data using Logistic Regression and get an accuracy of 78%. We can draw the regression line which separates both the classes. This is a Hypothesis(h). Then we test this hypothesis on test data and get a score of 74%.
Now, again assume we fit a RandomForests model on the same data and get an accuracy score of 85%. This is a good improvement over Logistic Regression already. Now we decide to tune the hyperparameters of RandomForests to get a better score on the same data. We do a grid search and run multiple RandomForest models on the data and check their performance. In this step, we are essentially searching the Hypothesis Space(H) to find a better function. After completing the grid search, we get the best score of 89% and we end the search.
FYI: Free nlp course!
Now we also try more models like XGBoost, Support Vector Machine and Naive Bayes theorem to test their performances on the same data. We then pick the best performing model and test it on the test data to validate its performance and get a score of 87%.
Checkout: Machine Learning Projects & Topics
Before you go
The hypothesis is a crucial aspect of Machine Learning and Data Science. It is present in all the domains of analytics and is the deciding factor of whether a change should be introduced or not. Be it pharma, software, sales, etc. A Hypothesis covers the complete training dataset to check the performance of the models from the Hypothesis space.
A Hypothesis must be falsifiable, which means that it must be possible to test and prove it wrong if the results go against it. The process of searching for the best configuration of the model is time-consuming when a lot of different configurations need to be verified. There are ways to speed up this process as well by using techniques like Random Search of hyperparameters.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s Executive PG Programme in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.
Why should we do open-source projects?
There are many reasons to do open-source projects. You are learning new things, you are helping others, you are networking with others, you are creating a reputation and many more. Open source is fun, and eventually you will get something back. One of the most important reasons is that it builds a portfolio of great work that you can present to companies and get hired. Open-source projects are a wonderful way to learn new things. You could be enhancing your knowledge of software development or you could be learning a new skill. There is no better way to learn than to teach.
Can I contribute to open source as a beginner?
Yes. Open-source projects do not discriminate. The open-source communities are made of people who love to write code. There is always a place for a newbie. You will learn a lot and also have the chance to participate in a variety of open-source projects. You will learn what works and what doesn't and you will also have the chance to make your code used by a large community of developers. There is a list of open-source projects that are always looking for new contributors.
How does GitHub projects work?
GitHub offers developers a way to manage projects and collaborate with each other. It also serves as a sort of resume for developers, with a project's contributors, documentation, and releases listed. Contributions to a project show potential employers that you have the skills and motivation to work in a team. Projects are often more than code, so GitHub has a way that you can structure your project just like you would structure a website. You can manage your website with a branch. A branch is like an experiment or a copy of your website. When you want to experiment with a new feature or fix something, you make a branch and experiment there. If the experiment is successful, you can merge the branch back into the original website.