Hypothesis testing is a crucial aspect of any Statistical Analysis. However, there are a lot of things to be predefined so that the test we conduct can be as correct as possible. Here is where the concept of power comes into play and defines the heuristics of a Statistical Test.
By the end of this tutorial, you will know:
- Heuristics of Statistical Tests
- What is the Power of a test?
- What is the need for Power Analysis?
- How to carry out Power Analysis
Table of Contents
Heuristics of Statistical Tests
Carrying out correct Statistical Tests upon several heuristics which need to be preset before conducting the test. It is highly important to set the right heuristics as these cannot be changed once the test is started. Let’s have a look at few of these.
1. Significance Level and Confidence Interval
Before starting any statistical test, a threshold of probability needs to be set. This threshold or significance level is called the Critical Value (alpha). The complete region under the probability curve beyond the alpha value is called the Critical Region.
The alpha value tells us how farther the sample data point (or the experimental point) must be from the null hypothesis(original mean point) before concluding that it is unusual enough to reject the null hypothesis. A common value of alpha that is used is 0.05 or 95% confidence interval.
To evaluate whether the test results that we got are statistically significant or not, we compare the Critical Value (alpha) that we had set before the test with the P-Value of the test. The p-value is the probability of getting values as extreme or even more extreme as the value we are testing for.
3. Type 1 & Type 2 Errors
The Statistical Tests can never be 100% certain. There is always room for error and getting misled by the results. As discussed above, if we set an alpha value of 0.05, there is a confidence interval of 95%. Therefore, there is a 5% chance that the result you’ve got is incorrect and misleading. These incorrect results are what we call as errors. There are 2 types of error – Type 1 & Type 2.
The significance level value of 0.05 means that your statistical test will be 95% times correct. Which also means that there is a 5% chance of it being incorrect! That will be a case of you rejecting the null hypothesis when it was correct. This is an example of a Type 1 Error. And we can also say that alpha(α) is the probability of committing a Type 1 error.
It can also be a case when you conclude that the null hypothesis is true or accept it when it is false. Technically, we can never accept the null hypothesis. We can only fail to reject it. This is what we call a Type 2 Error. Similarly, the probability of you making a Type 2 error is given by Beta — β.
What is the Power of a Statistical Test?
Power of a test is the probability of correctly rejecting the Null Hypothesis when it is false. Or in other words, Power is inversely proportional to the probability of making a type 2 error. Therefore, Power = 1-β. For example, if we set the power to be 80%, then we mean that 80% of our statistical tests are correct and not the bogus ones. Therefore, the higher the power value, the lesser is the probability of committing a type 2 error.
But why can the results be bogus? This is because we are dealing with random samples here. And sometimes the sample that is taken is too far from the mean of the distribution and hence gives unrealistic results, forcing us to make incorrect decisions. The whole aim of Power Analysis is to prevent us from making these incorrect decisions.
Are we P-Hacking?
Let’s take up an example where we have made a vaccine for COVID-19 and we are very much sure that the vaccine will have significant results. We proceed to conduct a Statistical test to see if our belief holds true statistically as well. So set the alpha as 0.05 and carry out a test using 100 samples.
After the test, we get a P-value as 0.06. We see that it is so close to our alpha but not less than it so that we can safely reject the null hypothesis. It gets tempting to see what happens if we increase the samples and redo the test.
So we add 50 more samples and see that the P-Value now comes as 0.045. Did we just prove our vaccine to be statistically significant? NO! We just P-hacked as we increased the number of samples after we got the first result. Learn more about What is P-Hacking & How To Avoid It?
What is Power Analysis?
As we saw in the above example, we found that the sample size was small and we increased it later. This is wrong and should never be done. The sample size value should be preset before starting the test itself. But what value of sample size is right for us?
Let’s consider an example where we carry out multiple tests using sample size as just 1. Therefore, when we sample 1 data point randomly from the population, it can be either around the mean which correctly represents our data, or it can be also a lot far away from the mean and does not represent the data well.
The issue arises when we conduct statistical tests using these far off data points. The P-value that we will get will be incorrect. We now conduct another series of tests taking 2 as the sample size. Now even if one value is far off from the data mean, the other value which is on the other side of the distribution will pull the average of them to centre, hence reducing the effect of that far off value. Therefore, with a sample size of 2, our results will more true with correct P-Values.
Power Analysis is the technique used to find out the right amount of sample size that is needed to conduct tests as well as possible. Higher the Power that we need more is the amount of sample size that will be required. So you might think that why not just take a large sample size because a large sample size means better and more trustable results. This is not right as collecting data is costly and knowledge of the sample size required is essential.
How to carry out Power Analysis?
The power of a test depends on some factors. The first step to carry out a power analysis is to set a Power Value. Consider that you set a common power of 0.8, meaning that you want to have at least an 80% chance of correctly rejecting the null hypothesis. If we are validating the effect of COVID-19 vaccine on a set of people, we want to prove that the distribution of data points of vaccinated people is different from that of people that were given a placebo.
1. Amount of overlap
We need to consider the amount of overlap between the two distributions we are comparing. More the overlap, more difficult it will be for us to safely reject the null and hence we’ll need more sample size. However, if the overlap is very less, then we can quite easily safely reject the null. And we’d require quite less sample size. Overlap depends on the distance between the means of the two distributions and their standard deviations.
2. Effect size
Effect size is a way to combine the effects of the difference between the means and the standard deviations of the populations. Effect size (d) is calculated as The estimated difference between the means divided by Pooled estimated standard deviations. One of the simplest ways to calculate Pooled estimated Standard Deviations is Square root of the squared sum of Standard deviations divided by 2.
So once we have Power value, alpha value and the effect size, we can plug these values into a Statistics Power Calculator and get the sample size value. Such a Statistics Power Calculator is easily available on the internet.
Get data science certification from the World’s top Universities. Learn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.
Before you go
We calculated the sample size by carrying out Power Analysis using Power, alpha and effect size. So if we got a sample size value of 7, it will mean that we need a sample size of 7 to have an 80% chance of correctly rejecting the Null Hypothesis. Having the right amount of domain expertise is also crucial to estimate the population means and their overlaps and the power required.
If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.
What is Power Analysis?
The power of a test or Power analysis is the probability of correctly rejecting the Null Hypothesis when it is false. Or in other words, Power is inversely proportional to the probability of making a type 2 error. Therefore, Power = 1-β. For example, if we set the power to be 80%, then we mean that 80% of our statistical tests are correct and not bogus ones. Therefore, the higher the power value, the lesser is the probability of committing a type 2 error. Power Analysis is all about preventing the wrong decisions as we are handling various random samples and there is a high chance that their mean would give an unrealistic mean and lead us to make incorrect decisions.
What factors are considered while carrying our Power Analysis?
There are certain factors that affect the test for power analysis. The very first step is to set the power value. Suppose we have a power of 0.7 value which implies that you have a 70% chance of rejecting the null hypothesis. Below are the affecting factors of Power analysis. The amount of overlap is the overlap between the two distributions that are being compared. The overlap should be as small as possible since the amount of overlap is directly proportionate to the difficulty to calculate null. Effect size is a method to club the difference between the mean and the standard deviation of the populations. It is denoted by “d” and is calculated as the estimated difference between the means divided by Pooled estimated standard deviations. Since now we have the power value, alpha value(amount of overlap), and the effect size, we can easily carry out the Power Analysis.
What is P-Hacking?
P-Hacking or Data dredging is a method to misuse the data analysis techniques to find patterns in data that appear significant but are not. This method affects the study negatively as it gives false promises to provide significant data patterns which can, in turn, lead to a drastic increase in the number of false positives. P-hacking can not be prevented completely but there are some methods that can surely reduce it and help avoid the trap.