Chi-Square Test Explained: Easy Guide with Examples

By Rohit Sharma

Updated on Jul 22, 2025 | 14 min read | 2.26K+ views

Share:

Did you know? The finance and banking sector dominates the data analytics world, contributing 37% of total revenueMarketing and e-commerce follow with 26% and 15%, respectively, with tools like the Chi-Square Test driving key insights across these industries.

The Chi-Square Test is a statistical method used to assess relationships between categorical variables. It determines whether observed frequencies align with expected frequencies under a null hypothesis. 

This test is widely used in hypothesis testing, data analysis, and pattern recognition. Real-time applications include market research, medical diagnostics, and quality control.

In this blog, you’ll learn about the concept of the Chi-Square Test, its types, assumptions, and formulas. We will also provide a step-by-step example to help you understand how the Chi-Square test works.

To strengthen your understanding of Chi-Square tests and other core data analytics techniques, enroll in upGrad’s Artificial Intelligence & Machine Learning Courses. You'll gain hands-on experience with NLP, deep learning, neural networks, and more.

What is the Chi-Square Test? 2 Key Types Explained

The Chi-Square Test (χ²) is a statistical method used in statistics and data analysis to analyze categorical data. It compares actual results with expected results, assuming no relationship between variables. The test focuses on categorical data, divided into distinct groups, such as gender, product type, or location. The formula for the chi-square test is:

χ 2 = O i - E i 2 E i

 

Where:

  • Oi= Observed frequency in each category
  • Ei = Expected frequency in each category
  • Σ= Summation over all categories

Looking to enhance your understanding of the Chi-Square Test and build future-ready data analytics skills? Explore upGrad’s industry-recognized programs, designed for hands-on learning in GenAI, machine learning, and applied analytics:

Now, let’s explore the two main types of Chi-Square tests, which help analyze relationships between variables or check data distribution.

1. Chi-Square Test of Independence

The Chi-Square Test of Independence is used to determine if two categorical variables are independent or related. This test checks if the distribution of one variable is influenced by the other. It's commonly used in surveys and experiments to understand relationships between variables.

Purpose: Checks if two variables are related or independent.

Example:

  • Gender vs. Product Preference:
    • Suppose you want to know whether gender influences product preference.
    • You survey 100 people and record their product choices (A or B) along with their gender (male or female).
    • The Chi-Square Test of Independence can help you determine if gender has any effect on product choice.

Want to build full-stack apps with data features like Chi-Square-based A/B testing? Join upGrad’s AI-Powered Full Stack Development Course by IIITB to learn backend, APIs, and data handling for analytics-driven applications in 9 months.

Also Read: Basic Fundamentals of Statistics for Data Science

2. Chi-Square Goodness of Fit Test

The Chi-Square Goodness of Fit Test checks whether the observed sample data matches an expected population distribution. It is used when you have one categorical variable and want to determine if the data's distribution fits a specific theoretical pattern.

Purpose: Checks if sample data matches a specific population distribution.

Example:

  • Dice roll outcomes:
    • Imagine you roll a fair six-sided die 60 times. The expected result for each number (1 through 6) is 10 rolls per number.
    • After rolling, you observe the frequency of each number.
    • The Chi-Square Goodness of Fit Test helps determine if the results align with the expected uniform distribution or if the die may be biased.

Also Read: What Is Pattern Recognition and Machine Learning? Importance and Applications

Let’s now understand the key conditions and assumptions that must be met for the Chi-Square Test to produce valid and reliable results.

Conditions and Assumptions for the Chi-Square Test

The Chi-Square Test relies on several critical assumptions to ensure its validity. Violating these assumptions can lead to misleading results. These assumptions are rooted in the mathematical properties of categorical data and are essential for accurate p-value computation.

Here’s a detailed overview of each assumption:

Machine Learning Courses to upskill

Explore Machine Learning Courses for Career Progression

360° Career Support

Executive PG Program12 Months
background

Liverpool John Moores University

Master of Science in Machine Learning & AI

Double Credentials

Master's Degree18 Months

1. Independence of Observations

Observations in the dataset must be independent of each other. This ensures that no other factor influences each data point. Statistical tests assume that data points are randomly selected, without dependency.

  • Example: In a study comparing the effectiveness of two treatments, the results of one patient should not influence those of the next patient.

2. Categorical Data

The Chi-Square Test applies only to categorical data. This data is divided into categories, such as “yes/no” or “male/female.” Ordinal data, where categories have a natural order, can also be used, but proper handling is required.

  • Example: A survey question asking "Which product do you prefer?" is categorical (product A, B, C).

3. Expected Frequency Assumption

Each expected frequency must be at least 5 for the Chi-Square approximation to hold. This ensures the test statistic follows the chi-squared distribution closely. If any expected frequency is less than 5, the asymptotic approximation may no longer be valid.

  • Example: A table with expected counts less than 5 could distort the Chi-Square statistic, leading to incorrect conclusions. In such cases, Fisher’s Exact Test may be used for more accurate results.

4. Sample Size

A larger sample size ensures the Chi-Square distribution is approximated correctly. A small sample can distort the distribution of the test statistic. Larger samples provide more reliable p-values.

  • Example: A sample size of 30 is often recommended, though larger samples can provide more precise results.

5. No Cell Frequency Less Than 5

The expected frequency in each contingency table cell should be five or greater. Cells with fewer than 5 expected counts can cause the test to overestimate or underestimate the chi-squared statistic. This assumption is critical for the validity of the results.

  • Example: A category with fewer than five expected observations can skew the results, leading to an unreliable p-value.

6. Adequate Number of Categories

The number of categories in each group should be sufficient to detect meaningful differences. With fewer categories, the test may lose statistical power, making it more difficult to identify relationships between variables.

  • Example: If too many categories are collapsed to satisfy the expected frequency condition, the statistical power of the test can be reduced.

7. Chi-Square Distribution

The Chi-Square statistic follows a Chi-Square distribution with degrees of freedom (df) calculated based on the number of categories. This distribution is central to hypothesis testing and the interpretation of the p-value. A Chi-Square statistic with low df (i.e., fewer categories) may not approximate the distribution accurately.

  • Example: In a 2x2 contingency table, df = (2 - 1) x (2 - 1) = 1. The degrees of freedom determine the critical value for the Chi-Square statistic.

Want to apply statistical tests like Chi-Square using Python confidently? Enroll in  Learn Python Libraries: NumPy, Matplotlib & Pandas by upGrad. In just 15 hours, you’ll build essential skills in data manipulation, visualization, and analysis.

Also Read: Statistics for Data Science: A Complete Guide

Let’s now explore the specific steps involved in performing the chi-square test, ensuring you understand the process in detail.

Step-by-Step Process to Perform a Chi-Square Test

Performing a chi-squared test requires a systematic approach to determine if a significant relationship exists between categorical variables. It involves collecting observed data, calculating expected frequencies, and comparing them using the Chi-Square formula to assess the significance.

Here is a step-by-step guide to performing a Chi-Square Test:

Step 1: State the Hypothesis

In any statistical test, clearly defining the hypotheses is crucial for concluding:

  • Null Hypothesis (H₀): Assumes there is no significant relationship or association between the variables. The observed and expected frequencies are consistent under this assumption. Mathematically, this means the variables are independent.
    • Example: In a market research study, H₀ would assume no difference in purchasing behavior between two groups.
  • Alternative Hypothesis (H₁): Assumes there is a significant relationship between the variables. The observed frequencies differ significantly from the expected frequencies, indicating the variables are dependent.
    • Example: In the same market research study, H₁ would assume a relationship exists between the variables (e.g., purchasing behavior differs between two groups).

Step 2: Collect Data and Construct a Contingency Table

The data should be organized into a contingency table (also called a cross-tabulation or crosstab), where each cell represents the frequency of occurrences of specific combinations of variables. This table helps visualize the relationship between two categorical variables.

  • Rows represent one categorical variable (e.g., gender, treatment type).
  • Columns represent the second categorical variable (e.g., disease presence, product preference).

Example: If we are analyzing the effect of a new marketing campaign on customer preferences, the contingency table might look like:

Groups Product A Product B Product C Total
Group 1 50 30 20 100
Group 2 40 40 20 100
Total 90 70 40 200

 

Step 3: Calculate the Expected Frequencies

Expected frequencies are the frequencies you would expect if the null hypothesis were true, i.e., if the variables were independent. The formula for calculating the expected frequency (Eᵢ) for each cell is:

E i = R o w   T o t a l × ( C o l u m n   T o t a l ) G r a n d   T o t a l

 

Where:

  • Row Total is the sum of the observed frequencies in a specific row.
  • Column Total is the sum of the observed frequencies in a specific column.
  • Grand Total is the total number of observations in the contingency table.

For example, for a specific cell in the contingency table, the expected frequency would be calculated as:

E P r o d u c t   A ,   G r o u p   1 = 100 × 90 200 = 45

 

The expected values for all cells are calculated in a similar manner. These values are used to compare against the observed frequencies.

Step 4: Calculate the Chi-Square Statistic

The Chi-Square statistic is calculated by summing the squared differences between observed and expected frequencies, divided by the expected frequency for each cell. The formula is:

χ 2 = O i - E i 2 E i

 

Where:

  • Oi is the observed frequency for cell i.
  • Ei is the expected frequency for cell i.

The calculation is done for each cell in the contingency table, and the results are summed to obtain the overall chi-squared statistic.

For example, continuing from the earlier table:

χ 2 = ( 50 - 45 ) 2 45 + 30 - 35 2 35 + . .

 

This will give the overall Chi-Square value, which is then used to compare against the critical value from the Chi-Square distribution.

Step 5: Determine the Degrees of Freedom

The degrees of freedom (df) for a Chi-Square Test is calculated as:

d f = ( r - 1 ) ( c - 1 )

 

Where:

  • r is the number of rows in the contingency table.
  • c is the number of columns in the contingency table.

The degrees of freedom indicate how many independent comparisons you can make between categories.

For example, in a 2x2 table (two rows and two columns):

d f = ( 2 - 1 ) ( 2 - 1 ) = 1

 

This degree of freedom is used to find the critical value from the Chi-Square distribution table.

Step 6: Compare the Chi-Square Statistic with the Critical Value

Using the Chi-Square distribution table, find the critical value for the chosen significance level (usually α = 0.05) and the calculated degrees of freedom. The critical value is the threshold beyond which you would reject the null hypothesis.

  • If the calculated Chi-Square statistic is greater than the critical value, reject the null hypothesis (H₀).
  • If the calculated Chi-Square statistic is less than the critical value, fail to reject the null hypothesis (H₀).

This determines whether the observed differences between categories are statistically significant or due to chance.

Step 7: Interpret the Results

Based on the comparison in Step 6, you conclude whether there is a significant relationship between the variables.

  • If the null hypothesis (H₀) is rejected, it suggests a significant relationship between the variables.
  • If the null hypothesis (H₀) is not rejected, it suggests no significant relationship, meaning the observed frequencies align with the expected frequencies.

Example: Chi-Square Goodness of Fit Test - Suppose you have a bag of 100 M&Ms, and the expected color distribution is as follows:

Color Expected (%) Observed Count
Red 20% 18
Green 20% 22
Yellow 20% 15
Orange 20% 25
Blue 20% 20

Now, let's perform the Chi-Square Goodness of Fit Test:

Step 1: Hypotheses

  • Null Hypothesis (H₀): The color distribution follows the expected distribution.
  • Alternative Hypothesis (H₁): The color distribution does not follow the expected distribution.

Step 2: Observed Frequencies - The observed counts are: Red = 18, Green = 22, Yellow = 15, Orange = 25, Blue = 20.

Step 3: Expected Frequencies - Since the total number of M&Ms is 100, the expected count for each color is 20.

Step 4: Calculate the Chi-Square Statistic - Using the formula:

χ 2 = ( 18 - 20 ) 2 20 + ( 22 - 20 ) 2 20 + ( 15 - 20 ) 2 20 + ( 25 - 20 ) 2 20 + ( 20 - 20 ) 2 20 = 2.4

 

Step 5: Degrees of Freedom - The degrees of freedom is:

d f = 5 - 1 = 4

 

Step 6: Critical Value - For α = 0.05 and df = 4, the critical value from the Chi-Square distribution table is approximately 9.488.

Step 7: Conclusion - Since 2.4 < 9.488, we fail to reject the null hypothesis. There is no significant difference between the observed and expected color distribution of M&Ms.

Wondering how chi-square helps select key features in NLP tasks? Enroll in upGrad’s Introduction to Natural Language Processing Course. In just 11 hours, you'll learn key concepts like RegExp, spell correction, phonetic hashing, and spam detection.

Also Read: How to Interpret R Squared in Regression Analysis?

Now let’s explore the most common mistakes and how to prevent them to ensure accurate analysis.

Common Mistakes to Avoid in Chi-Square Testing

When performing a chi-squared test, understanding the assumptions and nuances is crucial. Failing to meet these assumptions or misinterpreting the results can lead to misleading or inaccurate conclusions.

Below are key mistakes to avoid and measures to ensure the validity of your Chi-Square results:

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

1. Ignoring the Assumption of Minimum Expected Frequency

One of the most critical assumptions in the Chi-Square test is that each expected frequency should be at least 5. Violating this assumption can make the test unreliable, especially when expected frequencies are too low.

Best Practices:

  • Ensure that the sample size is sufficiently large to accommodate all expected frequencies of at least 5.
  • If any expected frequency is too small, you may need to combine categories or use Fisher’s Exact Test for small sample sizes.

2. Using Chi-Square for Small Sample Sizes

The Chi-Square test requires a sufficiently large sample size. When dealing with small datasets or low counts in specific categories, the Chi-Square test can give inaccurate results, especially when the expected frequency in any cell is too low.

Best Practices:

  • Before conducting the test, verify that the expected frequencies for each category are greater than 5.
  • For small datasets, consider using Fisher’s Exact Test or Exact tests for contingency tables as alternatives.

3. Assuming Independence Between Observations

The Chi-Square test assumes that all observations are independent. This means that the occurrence of one event should not affect the occurrence of another. Violating this assumption leads to skewed results.

 Best Practices:

  • Ensure that the observations in your dataset are independent of one another.
  • If data is dependent (e.g., repeated measures), use alternative tests such as McNemar’s test or Generalized Estimating Equations (GEE).

4. Miscalculating Degrees of Freedom

The degrees of freedom (df) are essential for correctly interpreting the chi-squared statistic. Incorrectly calculating degrees of freedom can result in comparing the statistic to the wrong critical value, leading to incorrect conclusions.

Best Practices:

  • For a Goodness of Fit Test, df = (number of categories - 1).
  • For a Test of Independence, df = (number of rows - 1) * (number of columns - 1).
  • Double-check your degrees of freedom before calculating the Chi-Square statistic.

5. Incorrectly Interpreting the p-value

A p-value of less than 0.05 indicates a statistically significant relationship between the variables. However, statistical significance does not necessarily imply practical significance or a strong relationship.

Best Practices:

  • Along with the p-value, report the effect size (e.g., Cramér’s V) to measure the strength of the relationship between variables.
  • Understand that a small p-value does not necessarily imply that the relationship is meaningful in real-world terms.

6. Using Incorrect Data Types for the Chi-Square Test

The chi-squared test is explicitly designed for categorical data. Applying it to continuous data or ordinal data may result in inaccurate results.

Best Practices:

  • Ensure that your data is categorical (e.g., gender, product preference).
  • For continuous data, use tests such as t-tests or ANOVA; for ordinal data, consider the Mann-Whitney U test.

Want to use Chi-Square for feature selection in AI? Check out upGrad’s Advanced Generative AI Certification Course. In just 5 months, you’ll learn to use Copilot to generate Python code, debug errors, analyze data, and create visualizations.

Also Read: Top 26 Web Scraping Projects for Beginners and Professionals

Now, let’s see how upGrad can help you learn more about the Chi-Square Test and other essential concepts in data analytics.

How upGrad Can Help You Stay Ahead in Data Analytics?

The Chi-Square Test helps determine if categorical variables are related or if observed data aligns with an expected distribution. By comparing observed data with expected frequencies, researchers can identify patterns and make data-driven decisions.

To stay competitive, proficiency in tools like Pandas, NumPy, and SciPy is essential for analyzing and performing these tests. upGrad ensures you stay ahead by offering hands-on experience with advanced tools and practical expertise in key technologies.

Here are a few additional upGrad courses that can help you stand out:

Not sure which data analytics program best aligns with your career goals? Contact upGrad for personalized counseling and valuable insights, or visit your nearest upGrad offline center for more details.

Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.

Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.

Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.

References:
https://www.imarcgroup.com/india-data-analytics-market/
https://www.statista.com/topics/4677/analytics-market-in-india/
https://www.statista.com/topics/4677/analytics-market-in-india/

Frequently Asked Questions

1. Can the Chi-Square Test be used with ordinal data?

The Chi-Square Test is primarily used for categorical variables, but it can be applied to ordinal data if treated categorically. However, it does not account for the natural order present in ordinal data, which can limit interpretability. When ordinal properties are relevant, non-parametric tests such as the Mann-Whitney U or Kruskal-Wallis are preferred. These alternatives provide more reliable insights by incorporating rank-based comparisons between ordered groups.

2. What is the null hypothesis in a Chi-Square Test?

In a chi-squared test, the null hypothesis assumes no association exists between the variables being analyzed. For a test of independence, it means the two categorical variables are unrelated across the population. In a goodness-of-fit test, it assumes the observed data follows the expected theoretical distribution. A low p-value leads to rejecting the null, indicating a statistically significant relationship or mismatch.

3. What is the difference between the Chi-Square Test of Independence and the Goodness of Fit Test?

The Chi-Square Test of Independence checks whether two categorical variables are statistically related or independent. The Goodness of Fit Test, in contrast, compares observed frequencies to expected values based on a theoretical distribution. Both use the Chi-Square statistic but serve different analysis goals. One examines the association between variables; the other evaluates distributional conformity.

4. How do you determine whether the Chi-Square Test is appropriate for your data?

The Chi-Square Test requires categorical data and independently sampled observations across all categories. Each expected frequency in the contingency table should be at least five for the approximation to hold. Random sampling is also necessary to meet the statistical assumptions of the test. If these conditions are satisfied, the Chi-Square Test can yield valid and interpretable results.

5. Can a Chi-Square Test be used for multiple categorical variables?

Yes, the Chi-Square Test can be extended to analyze more than two categorical variables using higher-dimensional contingency tables. These multi-way tables enable the examination of interactions or dependencies among several categorical factors. As the number of variables increases, so do the degrees of freedom and the complexity of the table. Interpretation becomes more involved but still provides valid statistical conclusions.

6. What is the role of the Chi-Square distribution in the test?

The chi-squared distribution serves as the reference for determining the significance of the test statistic. Its shape depends on the degrees of freedom, which increase with the addition of more categories or variables. A higher degree of freedom makes the distribution approach a standard curve. The test statistic is compared to this distribution to compute the p-value.

7. Can the Chi-Square Test handle missing data?

The Chi-Square Test assumes complete data without any missing entries, which ensures valid expected frequency calculations. If data is missing, results may become biased or invalid due to incorrect cell counts. You can handle missing data by removing affected records or using imputation techniques to replace the missing values. Careful preprocessing is essential to maintain statistical integrity.

8. How does the sample size impact the Chi-Square Test?

The sample size significantly impacts the validity of the Chi-Square Test, particularly about expected frequencies. A small sample can lead to low expected counts, violating test assumptions. Ideally, each cell should have an expected frequency of at least five. Larger samples ensure the test statistic closely follows the Chi-Square distribution.

9. What do you do if the Chi-Square Test gives a large value?

A significant Chi-Square statistic suggests a strong difference between observed and expected frequencies across the categories. This typically results in a low p-value, indicating statistical significance. You should reject the null hypothesis in such cases, as the variables are likely dependent. Always verify assumptions before final interpretation of results.

10. How does the Chi-Square Test handle categorical variables with more than two categories?

The test accommodates multi-category variables by creating larger contingency tables with multiple rows and columns. Each cell must meet the expected frequency requirement to ensure validity. The test remains reliable as long as assumptions are met, regardless of the number of categories. It is commonly used for analyzing age groups, preferences, or regions.

11. What is Cramér’s V, and how is it related to the Chi-Square Test?

Cramér’s V measures the strength of association between two categorical variables after performing a chi-squared test. It adjusts the Chi-Square statistic using degrees of freedom and sample size to produce a normalized value. The result ranges from 0 (no association) to 1 (perfect association). It complements the Chi-Square Test by quantifying the strength of the relationship.

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with AI & ML expert

+91

By submitting, I accept the T&C and
Privacy Policy

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program

Top Resources

Recommended Programs

LJMU

Liverpool John Moores University

Master of Science in Machine Learning & AI

Double Credentials

Master's Degree

18 Months

IIITB
bestseller

IIIT Bangalore

Executive Diploma in Machine Learning and AI

360° Career Support

Executive PG Program

12 Months

upGrad
new course

upGrad

Advanced Certificate Program in GenerativeAI

Generative AI curriculum

Certification

4 months