Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconData Sciencebreadcumb forward arrow iconWhat on Earth is Simpson’s Paradox? How Does it Affect Data?

What on Earth is Simpson’s Paradox? How Does it Affect Data?

Last updated:
14th Jun, 2023
Views
Read Time
8 Mins
share image icon
In this article
Chevron in toc
View All
What on Earth is Simpson’s Paradox? How Does it Affect Data?

Simpson’s paradox is a phenomenon in probability and statistics, in which a trend appears in different groups of data, but disappears or reverses when these groups are combined.
You need to be very careful while calculating averages or pooling data from different sectors. It is always better to check whether the pooled data tell the same story or a different one from that of the non-aggregated data. If the story is different, then there is a high probability of Simpson’s paradox. A lurking variable must be affecting the direction of the explanatory and target variables.

Learn data science courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Historical Background

Simpson’s Paradox was discovered in the early twentieth century, with contributions from various statisticians and scholars. In 1951, Edward H. Simpson, a British statistician, found one of the earliest prominent examples. However, the paradox itself had been observed in various forms even before Simpson’s work.

Simpsons Paradox refers to a phenomenon in which an apparent trend or relationship in aggregated data reverses or disappears when the data is disaggregated into subgroups. If not fully understood and accounted for, this surprising discovery might lead to incorrect findings.

Consider the famous Simpson’s Paradox example to gain a better understanding of the dilemma it presents. Assume two departments, A and B, at a university and the goal is to compare their respective acceptance rates of male and female candidates. On a surface analysis of the aggregated data, it appears that Department A has a higher admittance rate for both males and females than Department B; however, when we break down the data by gender, we see that while Department A has a higher admittance rate for both genders, Department B actually has a lower rate for each gender combined. This trend reversal at the subgroup level is an example of Simpson’s Paradox.

Real-world Applications

Simpson’s Paradox has far-reaching implications and has been observed in various domains, including social sciences, healthcare, education, economics, and sports. Understanding this Simpson’s paradox in data science is crucial for avoiding misinterpretation of data and making accurate decisions.

In the field of healthcare, Simpson’s Paradox has been encountered in studies evaluating the effectiveness of treatments. For instance, a drug may show positive effects overall but fail to demonstrate efficacy when the data is analyzed based on different patient characteristics or disease severity levels. This highlights the importance of considering subgroup analyses to gain a comprehensive understanding of treatment outcomes.

In economics, Simpson’s Paradox can occur when analyzing income inequality across different regions or demographic groups. Aggregated data may suggest a decreasing income gap, but disaggregating the data could reveal that inequality actually worsens within each subgroup. This emphasizes the need to examine data from various perspectives to avoid overlooking underlying patterns.

Preventive Measures

To circumvent Simpson’s Paradox and guarantee precise study and analysis, researchers and investigators ought to take several preventive steps. First and foremost, it is essential to perform a subgroup analysis. By closely observing the data at the subgroup level, subtleties in the underlying connections can be exposed. This allows for a more astute understanding of the data and helps uncover potential confounding variables or interaction effects that can contribute to the paradox. Additionally, the sample size must be taken into account. Adequate sample sizes within subgroups are essential to obtain dependable and statistically substantial outcomes. Insufficient sample sizes can cause illogical determinations and exacerbate the odds of experiencing Simpson’s Paradox.

Contextual data is another significant factor to bear in mind. Understanding the exact setting in which the data was collected can help recognize conceivable predispositions and confounding factors. This data can then be incorporated into the analysis to offer a more exact elucidation of the discoveries. Lastly, by utilizing progressed factual techniques, such as multidimensional analysis and causal modeling, can give assistance to untangle the real connections between variables. These techniques permit distinguishing and controlling confounding factors, offering a stronger analysis.

By executing these preventive measures, researchers and analysts can minimize the danger of experiencing Simpson’s Paradox and enhance the accuracy and dependability of their discoveries. It is essential to approach data investigation with alertness and to consider the potential effect of subgroup results to guarantee logical choices in view of exact perceptions of the data.

Let us understand Simpson’s paradox with the help of an another example:
In 1973, a court case was registered against the University of California, Berkeley. The reason behind the case was gender bias during graduate admissions. Here, we will generate synthetic data to explain what really happened.

Let’s assume the combined data for admissions in all departments is as follows

Gender

Applicants

Admitted

Admission Percentage

Men

2,691

1,392

52%

Women

1,835

789

43%

If you observe the data carefully, you’ll see that 52% of the males were given admission, while only 43% of the women were admitted to the university. Clearly, the admissions favoured the men, and the women were not given their due. However, the case is not so simple as it appears from this information alone. Let’s now assume that there are two different categories of departments — ‘Hard’ (hard to get into) and ‘Easy’.

Our learners also read: Learn Python Online for Free

Let’s divide the combined data into these categories and see what happens

DepartmentAppliedAdmitted

Admission Percentage

Men

Women

Men

Women

Men

Women

Hard

780

1,266

200

336

26%

27%

Easy1,9115691,19245362%

80%

Do you see any gender bias here? In the ‘Easy’ department, 62% of the men and 80% of the women got admission. Likewise, in the ‘Hard’ department, 26% of the men and 27% of the women got admission. Is there any bias here? Yes, there is. But, interestingly, the bias is not in favour of the men; it favours the women!!! If you combine this data, then an altogether different story emerges. A bias favouring the men becomes apparent. In statistics, this phenomenon is known as ‘Simpson’s paradox.’ But why does this paradox occur?

Top Essential Data Science Skills to Learn

Simpson’s paradox occurs if the effect of the explanatory variable on the target variable changes direction when you account for the lurking explanatory variable. In the above example, the lurking variable is the ‘department.’ In the case of the ‘Easy’ department, the percentages of men and women applying were in equal proportion. While in the case of the ‘Hard’ department, more women applied than men, and this led to more women applications getting rejected. When this data is combined, it shows a visible bias towards male admissions, which is really non-existent.

Simpson's effect data science UpGrad Blog
Now suppose you were a statistician for the Indian government and inspected a fighter plane that returned from the Chinese war of 1965. Inspecting the bullet holes in the aircraft surface, what would you recommend? Would you recommend the strengthening of the areas hit by bullets?

The following is an excerpt from a StackExchange

“During World War II, Abraham Wald was a statistician for the U.S. government. He looked at the bombers that returned from missions and analysed the pattern of the bullet ‘wounds’ on the planes. He recommended that the Navy reinforce areas where the planes had no damage.

Read our popular Data Science Articles

upGrad’s Exclusive Data Science Webinar for you –

Watch our Webinar on How to Build Digital & Data Mindset?

 

Why? We have selective effects at work. This sample suggests that damage inflicted on the observed areas could be withstood. Either the plane was never hit in the untouched areas — an unlikely proposition — or strikes to those parts were lethal. We care about the planes that went down, not just those that returned. Those that fell likely suffered an attack in a place that was untouched on those that survived.”

In statistics, things are not as they appear on the surface. You need to be skeptical and look beyond the obvious during analyses. Maybe it’s time to read ‘Think Like a Freak’ or ‘How to Think Like Sherlock’. Let us know if you already have and what your thoughts are on the same!

Profile
Thulasiram is a veteran with 20 years of experience in production planning, supply chain management, quality assurance, Information Technology, and training. Trained in Data Analysis from IIIT Bangalore and UpGrad, he is passionate about education and operations and ardent about applying data analytic techniques to improve operational efficiency and effectiveness. Presently, working as Program Associate for Data Analysis at UpGrad.

Frequently Asked Questions (FAQs)

1What is the impact of Simpsons paradox on Data Analytics?

The necessity of comprehending the data and its limits is demonstrated by Simpson’s Paradox. As the world moves towards datasets gathered in extremely short spans of time, it reminds us of the importance of critical thinking when dealing with data, as well as looking for hidden biases and variables in the data. If the data is not stratified deeply enough, the Simpson paradox may exist. Even though the variation becomes modest, too much aggregation becomes irrelevant and produces bias. However, there will be insufficient data or information to identify the underlying pattern if we disaggregate too much. The variance has increased, but the bias has decreased. As a result, the Simpson Paradox can be considered the pinnacle of the Bias and Variance Trade-off.

2What causes Simpson’s Paradox?

It happens because disaggregation of the data causes some subgroups to have an imbalanced representation as compared to other groups. This could be as a result of the relationship between variables or because of the way data has been partitioned into subgroups. A famous example is that of admission data for graduate school at UC Berkeley in 1973. When admission data was looked at overall, it looked like men were more likely to be admitted than women but when data examined individually for each department, the opposite was true.

3Is it possible to avoid Simpson’s Paradox?

The answer is Yes. To avoid erroneous results, it’s usually a good idea to check whether the association in the aggregated dataset holds up in subsets, especially if some groups in the data aren’t equally represented. Another option is to weigh the samples based on their dimensions. Statistical analysis tools, however, are just that: tools to assist you in organising and analysing the data you’ve collected. They can’t give you any information about data that wasn’t collected or analysed. As a result, involving a multifunctional team, particularly subject matter experts and practitioners, is critical.

In a well-designed experiment or survey, Simpson’s paradox is unlikely to be an issue. You can identify potential hidden variables ahead of time and regulate them effectively by deleting them, maintaining them constant for all groups, or including them in the study. Randomization goes a long way toward limiting the effects of a hidden variable that might have been overlooked.

Explore Free Courses

Suggested Blogs

Python Developer Salary in India in 2024 [For Freshers & Experienced]
908669
Wondering what is the range of Python developer salary in India? Before going deep into that, do you know why Python is so popular now? Python has be
Read More

by Sriram

21 May 2024

Binary Tree in Data Structure: Properties, Types, Representation & Benefits
89024
Data structures serve as the backbone of efficient data organization and management within computer systems. They play a pivotal role in computer algo
Read More

by Rohit Sharma

21 May 2024

Data Analyst Salary in India in 2024 [For Freshers & Experienced]
22144
Summary: In this Article, you will learn about Data Analyst Salary in India in 2024. Data Science Job roles Average Salary per Annum Data Scient
Read More

by Shaheen Dubash

20 May 2024

Python Free Online Course with Certification [2024]
134644
Summary: In this Article, you will learn about python free online course with certification. Programming with Python: Introduction for Beginners Le
Read More

by Rohit Sharma

20 May 2024

13 Interesting Data Structure Projects Ideas and Topics For Beginners [2023]
248184
 In the world of computer science, understanding data structures is essential, especially for beginners. These structures serve as the foundation for
Read More

by Rohit Sharma

20 May 2024

Top 30 Python Pattern Programs You Must Know About
40865
Summary Pattern in Python or “Python patterns” is an essential part of Python programming, especially when you are just starting out with using algor
Read More

by Rohit Sharma

19 May 2024

15 Exciting Data Science Project Ideas &  Topics for Beginners [2024]
956192
Summary: In this Article, you will learn about 15 exciting data science project ideas & topics for beginners. 1. Beginner Level | Data Science P
Read More

by Rohit Sharma

16 May 2024

Binary Tree vs Binary Search Tree: Difference Between Binary Tree and Binary Search Tree
63079
Introduction Sorting is the process of arranging the data in a systematic order so that it can be analysed more effectively. The process of identifyi
Read More

by Rohit Sharma

16 May 2024

Top 12 Fascinating Python Applications in Real-World [2024]
157599
It is a well-established fact that Python is one of the most popular programming languages in both the coding and Data Science communities. But have y
Read More

by Rohit Sharma

16 May 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon