Chi-Square Test Explained: Easy Guide with Examples
By Rohit Sharma
Updated on Jul 22, 2025 | 14 min read | 2.26K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 22, 2025 | 14 min read | 2.26K+ views
Share:
Table of Contents
Did you know? The finance and banking sector dominates the data analytics world, contributing 37% of total revenue! Marketing and e-commerce follow with 26% and 15%, respectively, with tools like the Chi-Square Test driving key insights across these industries. |
The Chi-Square Test is a statistical method used to assess relationships between categorical variables. It determines whether observed frequencies align with expected frequencies under a null hypothesis.
This test is widely used in hypothesis testing, data analysis, and pattern recognition. Real-time applications include market research, medical diagnostics, and quality control.
In this blog, you’ll learn about the concept of the Chi-Square Test, its types, assumptions, and formulas. We will also provide a step-by-step example to help you understand how the Chi-Square test works.
To strengthen your understanding of Chi-Square tests and other core data analytics techniques, enroll in upGrad’s Artificial Intelligence & Machine Learning Courses. You'll gain hands-on experience with NLP, deep learning, neural networks, and more.
Popular AI Programs
The Chi-Square Test (χ²) is a statistical method used in statistics and data analysis to analyze categorical data. It compares actual results with expected results, assuming no relationship between variables. The test focuses on categorical data, divided into distinct groups, such as gender, product type, or location. The formula for the chi-square test is:
Where:
Looking to enhance your understanding of the Chi-Square Test and build future-ready data analytics skills? Explore upGrad’s industry-recognized programs, designed for hands-on learning in GenAI, machine learning, and applied analytics:
Now, let’s explore the two main types of Chi-Square tests, which help analyze relationships between variables or check data distribution.
The Chi-Square Test of Independence is used to determine if two categorical variables are independent or related. This test checks if the distribution of one variable is influenced by the other. It's commonly used in surveys and experiments to understand relationships between variables.
Purpose: Checks if two variables are related or independent.
Example:
Want to build full-stack apps with data features like Chi-Square-based A/B testing? Join upGrad’s AI-Powered Full Stack Development Course by IIITB to learn backend, APIs, and data handling for analytics-driven applications in 9 months.
Also Read: Basic Fundamentals of Statistics for Data Science
The Chi-Square Goodness of Fit Test checks whether the observed sample data matches an expected population distribution. It is used when you have one categorical variable and want to determine if the data's distribution fits a specific theoretical pattern.
Purpose: Checks if sample data matches a specific population distribution.
Example:
Also Read: What Is Pattern Recognition and Machine Learning? Importance and Applications
Let’s now understand the key conditions and assumptions that must be met for the Chi-Square Test to produce valid and reliable results.
The Chi-Square Test relies on several critical assumptions to ensure its validity. Violating these assumptions can lead to misleading results. These assumptions are rooted in the mathematical properties of categorical data and are essential for accurate p-value computation.
Here’s a detailed overview of each assumption:
Machine Learning Courses to upskill
Explore Machine Learning Courses for Career Progression
1. Independence of Observations
Observations in the dataset must be independent of each other. This ensures that no other factor influences each data point. Statistical tests assume that data points are randomly selected, without dependency.
2. Categorical Data
The Chi-Square Test applies only to categorical data. This data is divided into categories, such as “yes/no” or “male/female.” Ordinal data, where categories have a natural order, can also be used, but proper handling is required.
3. Expected Frequency Assumption
Each expected frequency must be at least 5 for the Chi-Square approximation to hold. This ensures the test statistic follows the chi-squared distribution closely. If any expected frequency is less than 5, the asymptotic approximation may no longer be valid.
4. Sample Size
A larger sample size ensures the Chi-Square distribution is approximated correctly. A small sample can distort the distribution of the test statistic. Larger samples provide more reliable p-values.
5. No Cell Frequency Less Than 5
The expected frequency in each contingency table cell should be five or greater. Cells with fewer than 5 expected counts can cause the test to overestimate or underestimate the chi-squared statistic. This assumption is critical for the validity of the results.
6. Adequate Number of Categories
The number of categories in each group should be sufficient to detect meaningful differences. With fewer categories, the test may lose statistical power, making it more difficult to identify relationships between variables.
7. Chi-Square Distribution
The Chi-Square statistic follows a Chi-Square distribution with degrees of freedom (df) calculated based on the number of categories. This distribution is central to hypothesis testing and the interpretation of the p-value. A Chi-Square statistic with low df (i.e., fewer categories) may not approximate the distribution accurately.
Want to apply statistical tests like Chi-Square using Python confidently? Enroll in Learn Python Libraries: NumPy, Matplotlib & Pandas by upGrad. In just 15 hours, you’ll build essential skills in data manipulation, visualization, and analysis.
Also Read: Statistics for Data Science: A Complete Guide
Let’s now explore the specific steps involved in performing the chi-square test, ensuring you understand the process in detail.
Performing a chi-squared test requires a systematic approach to determine if a significant relationship exists between categorical variables. It involves collecting observed data, calculating expected frequencies, and comparing them using the Chi-Square formula to assess the significance.
Here is a step-by-step guide to performing a Chi-Square Test:
Step 1: State the Hypothesis
In any statistical test, clearly defining the hypotheses is crucial for concluding:
Step 2: Collect Data and Construct a Contingency Table
The data should be organized into a contingency table (also called a cross-tabulation or crosstab), where each cell represents the frequency of occurrences of specific combinations of variables. This table helps visualize the relationship between two categorical variables.
Example: If we are analyzing the effect of a new marketing campaign on customer preferences, the contingency table might look like:
Groups | Product A | Product B | Product C | Total |
Group 1 | 50 | 30 | 20 | 100 |
Group 2 | 40 | 40 | 20 | 100 |
Total | 90 | 70 | 40 | 200 |
Step 3: Calculate the Expected Frequencies
Expected frequencies are the frequencies you would expect if the null hypothesis were true, i.e., if the variables were independent. The formula for calculating the expected frequency (Eᵢ) for each cell is:
Where:
For example, for a specific cell in the contingency table, the expected frequency would be calculated as:
The expected values for all cells are calculated in a similar manner. These values are used to compare against the observed frequencies.
Step 4: Calculate the Chi-Square Statistic
The Chi-Square statistic is calculated by summing the squared differences between observed and expected frequencies, divided by the expected frequency for each cell. The formula is:
Where:
The calculation is done for each cell in the contingency table, and the results are summed to obtain the overall chi-squared statistic.
For example, continuing from the earlier table:
This will give the overall Chi-Square value, which is then used to compare against the critical value from the Chi-Square distribution.
Step 5: Determine the Degrees of Freedom
The degrees of freedom (df) for a Chi-Square Test is calculated as:
Where:
The degrees of freedom indicate how many independent comparisons you can make between categories.
For example, in a 2x2 table (two rows and two columns):
This degree of freedom is used to find the critical value from the Chi-Square distribution table.
Step 6: Compare the Chi-Square Statistic with the Critical Value
Using the Chi-Square distribution table, find the critical value for the chosen significance level (usually α = 0.05) and the calculated degrees of freedom. The critical value is the threshold beyond which you would reject the null hypothesis.
This determines whether the observed differences between categories are statistically significant or due to chance.
Step 7: Interpret the Results
Based on the comparison in Step 6, you conclude whether there is a significant relationship between the variables.
Example: Chi-Square Goodness of Fit Test - Suppose you have a bag of 100 M&Ms, and the expected color distribution is as follows:
Color | Expected (%) | Observed Count |
Red | 20% | 18 |
Green | 20% | 22 |
Yellow | 20% | 15 |
Orange | 20% | 25 |
Blue | 20% | 20 |
Now, let's perform the Chi-Square Goodness of Fit Test:
Step 1: Hypotheses
Step 2: Observed Frequencies - The observed counts are: Red = 18, Green = 22, Yellow = 15, Orange = 25, Blue = 20.
Step 3: Expected Frequencies - Since the total number of M&Ms is 100, the expected count for each color is 20.
Step 4: Calculate the Chi-Square Statistic - Using the formula:
Step 5: Degrees of Freedom - The degrees of freedom is:
Step 6: Critical Value - For α = 0.05 and df = 4, the critical value from the Chi-Square distribution table is approximately 9.488.
Step 7: Conclusion - Since 2.4 < 9.488, we fail to reject the null hypothesis. There is no significant difference between the observed and expected color distribution of M&Ms.
Also Read: How to Interpret R Squared in Regression Analysis?
Now let’s explore the most common mistakes and how to prevent them to ensure accurate analysis.
When performing a chi-squared test, understanding the assumptions and nuances is crucial. Failing to meet these assumptions or misinterpreting the results can lead to misleading or inaccurate conclusions.
Below are key mistakes to avoid and measures to ensure the validity of your Chi-Square results:
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
1. Ignoring the Assumption of Minimum Expected Frequency
One of the most critical assumptions in the Chi-Square test is that each expected frequency should be at least 5. Violating this assumption can make the test unreliable, especially when expected frequencies are too low.
Best Practices:
2. Using Chi-Square for Small Sample Sizes
The Chi-Square test requires a sufficiently large sample size. When dealing with small datasets or low counts in specific categories, the Chi-Square test can give inaccurate results, especially when the expected frequency in any cell is too low.
Best Practices:
3. Assuming Independence Between Observations
The Chi-Square test assumes that all observations are independent. This means that the occurrence of one event should not affect the occurrence of another. Violating this assumption leads to skewed results.
Best Practices:
4. Miscalculating Degrees of Freedom
The degrees of freedom (df) are essential for correctly interpreting the chi-squared statistic. Incorrectly calculating degrees of freedom can result in comparing the statistic to the wrong critical value, leading to incorrect conclusions.
Best Practices:
5. Incorrectly Interpreting the p-value
A p-value of less than 0.05 indicates a statistically significant relationship between the variables. However, statistical significance does not necessarily imply practical significance or a strong relationship.
Best Practices:
6. Using Incorrect Data Types for the Chi-Square Test
The chi-squared test is explicitly designed for categorical data. Applying it to continuous data or ordinal data may result in inaccurate results.
Best Practices:
Want to use Chi-Square for feature selection in AI? Check out upGrad’s Advanced Generative AI Certification Course. In just 5 months, you’ll learn to use Copilot to generate Python code, debug errors, analyze data, and create visualizations.
Also Read: Top 26 Web Scraping Projects for Beginners and Professionals
Now, let’s see how upGrad can help you learn more about the Chi-Square Test and other essential concepts in data analytics.
The Chi-Square Test helps determine if categorical variables are related or if observed data aligns with an expected distribution. By comparing observed data with expected frequencies, researchers can identify patterns and make data-driven decisions.
To stay competitive, proficiency in tools like Pandas, NumPy, and SciPy is essential for analyzing and performing these tests. upGrad ensures you stay ahead by offering hands-on experience with advanced tools and practical expertise in key technologies.
Here are a few additional upGrad courses that can help you stand out:
Not sure which data analytics program best aligns with your career goals? Contact upGrad for personalized counseling and valuable insights, or visit your nearest upGrad offline center for more details.
Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.
Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.
Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.
References:
https://www.imarcgroup.com/india-data-analytics-market/
https://www.statista.com/topics/4677/analytics-market-in-india/
https://www.statista.com/topics/4677/analytics-market-in-india/
The Chi-Square Test is primarily used for categorical variables, but it can be applied to ordinal data if treated categorically. However, it does not account for the natural order present in ordinal data, which can limit interpretability. When ordinal properties are relevant, non-parametric tests such as the Mann-Whitney U or Kruskal-Wallis are preferred. These alternatives provide more reliable insights by incorporating rank-based comparisons between ordered groups.
In a chi-squared test, the null hypothesis assumes no association exists between the variables being analyzed. For a test of independence, it means the two categorical variables are unrelated across the population. In a goodness-of-fit test, it assumes the observed data follows the expected theoretical distribution. A low p-value leads to rejecting the null, indicating a statistically significant relationship or mismatch.
The Chi-Square Test of Independence checks whether two categorical variables are statistically related or independent. The Goodness of Fit Test, in contrast, compares observed frequencies to expected values based on a theoretical distribution. Both use the Chi-Square statistic but serve different analysis goals. One examines the association between variables; the other evaluates distributional conformity.
The Chi-Square Test requires categorical data and independently sampled observations across all categories. Each expected frequency in the contingency table should be at least five for the approximation to hold. Random sampling is also necessary to meet the statistical assumptions of the test. If these conditions are satisfied, the Chi-Square Test can yield valid and interpretable results.
Yes, the Chi-Square Test can be extended to analyze more than two categorical variables using higher-dimensional contingency tables. These multi-way tables enable the examination of interactions or dependencies among several categorical factors. As the number of variables increases, so do the degrees of freedom and the complexity of the table. Interpretation becomes more involved but still provides valid statistical conclusions.
The chi-squared distribution serves as the reference for determining the significance of the test statistic. Its shape depends on the degrees of freedom, which increase with the addition of more categories or variables. A higher degree of freedom makes the distribution approach a standard curve. The test statistic is compared to this distribution to compute the p-value.
The Chi-Square Test assumes complete data without any missing entries, which ensures valid expected frequency calculations. If data is missing, results may become biased or invalid due to incorrect cell counts. You can handle missing data by removing affected records or using imputation techniques to replace the missing values. Careful preprocessing is essential to maintain statistical integrity.
The sample size significantly impacts the validity of the Chi-Square Test, particularly about expected frequencies. A small sample can lead to low expected counts, violating test assumptions. Ideally, each cell should have an expected frequency of at least five. Larger samples ensure the test statistic closely follows the Chi-Square distribution.
A significant Chi-Square statistic suggests a strong difference between observed and expected frequencies across the categories. This typically results in a low p-value, indicating statistical significance. You should reject the null hypothesis in such cases, as the variables are likely dependent. Always verify assumptions before final interpretation of results.
The test accommodates multi-category variables by creating larger contingency tables with multiple rows and columns. Each cell must meet the expected frequency requirement to ensure validity. The test remains reliable as long as assumptions are met, regardless of the number of categories. It is commonly used for analyzing age groups, preferences, or regions.
Cramér’s V measures the strength of association between two categorical variables after performing a chi-squared test. It adjusts the Chi-Square statistic using degrees of freedom and sample size to produce a normalized value. The result ranges from 0 (no association) to 1 (perfect association). It complements the Chi-Square Test by quantifying the strength of the relationship.
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with AI & ML expert
By submitting, I accept the T&C and
Privacy Policy
Top Resources