Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconData Sciencebreadcumb forward arrow iconData Manipulation: How Can You Spot Data Lies?

Data Manipulation: How Can You Spot Data Lies?

Last updated:
24th Oct, 2017
Views
Read Time
7 Mins
share image icon
In this article
Chevron in toc
View All
Data Manipulation: How Can You Spot Data Lies?

A Google search for ‘average data scientist salary in India’ will return a happy result. 

Does this mean that any person who wants to enter this exotic field can expect this salary? Why not? What’s wrong with expecting to earn a sum claimed by a reputed website? After all, this website may have conducted some extensive research to arrive at this number. Yet, taking a decision based on this claim alone is not a good idea. But why? Read on!

What does “average” mean in the above Google search? Averages come in different flavours. These are mean, median, and mode. Which average does this “national average” refer to? If it is the mean, what can you infer from it? Check a result from another website.

Here it says, “Experience strongly influences income for this job”.

Why is this important?

A person with a rich experience may be drawing a better income than someone without any experience. An individual who graduated from a reputed institute could be earning more than someone who self-learned. There is a fair chance that a person could inflate his/her salary in a survey to boost his/her status. Or, a person could downplay his/her salary for other reasons such as taxes. In such scenarios, using the mean isn’t appropriate.

If you calculate the mean of such salaries, a few outliers will have an undue effect on the average obtained. They will pull the mean up. In such cases, the median is the true representative. It will indicate an equal number of people earning sums below and above it.

In the future, if you come across the word ‘average’ anywhere, look for amplifying information. Check if the author is referring to the mean, median, or mode. Check for confidence intervals and levels of significance. If these are not found, then there is reason enough to be skeptical.

 

Say, an endorsement specifies the type of average. Can you then take it to be absolute? No? Why not?

Let’s go back to the original statement about the average salary of data scientists. The statement claims to be from a sample of 303 salaries. Exactly one day ago, this number was 12. Is this a sample you can trust?

To conduct a survey or an experiment, the sample needs to be a true representative of the underlying population. The size of the sample must be large enough to confidently draw inferences about the population.
I was watching some lectures by Professor Starbird about statistics. I learned that years ago, a newspaper conducted a survey regarding the presidential elections in the US. This newspaper sent out a questionnaire, analysed it, and published the result that a particular candidate was going to win. After the election, the result was the opposite of what the paper predicted. The candidate predicted by the newspaper lost by a high margin. Subsequently, the newspaper analysed where it went wrong.

The paper’s management found that it only sent the questionnaire to its affluent subscribers. Evidently, they did not represent the whole population. As a consequence, the prediction based on this biased sample became a source of embarrassment for the newspaper.

You can infer whatever results you’d like to see by taking a very small sample! As a very basic example, if you toss a coin 10 times, do you get heads five times and tails five times? You could get seven heads in a row, and maybe this is the result you desire. The ‘law of averages’ will only work (i.e. half heads, half tails) when this coin-tossing experiment is performed a large number of times. In the short term, any result is possible.

If you don’t see information about the sample size along with the type of average, this is a cause for concern. If the sample size is sufficient and is a true representative of the population, then there is no need to hide it.

The Art of Statistics Data Sciences UpGrad Blog
A report claimed that in a particular college 33% of the male professors married their female students.

We need to be very careful with percentages. If percentages are not accompanied by the actual numbers, they may be misleading. In the college mentioned above, it turned out that only three women studied there, and just one married to a professor. One out of three makes 33%. Always check if percentages are accompanied by the actual numbers. If they are not, then there is a cause for concern.

Another major fallacy in statistics is confusing correlation with causation. If two items are correlated, then the assumption that one causes the other, is wrong.
In a group of aboriginal people, the presence of lice on the body was considered safe. If a person ran a fever in that tribe, it was observed that there were no lice on his/her body. So, the tribe naively assumed that this lack of lice was, in fact, the cause of the fever. Later it was found, when a person suffered from fever, the increased body temperature became uncomfortable for lice. The fever was causing lice to abandon their host; their absence was not the cause of the fever, as assumed.

Explore our Popular Data Science Courses

Our learners also read: Top Python Free Courses

Say, ‘A’ and ‘B’ are correlated. There could be some other variable ‘C’ that causes ‘A’ and ‘B’ to rise and fall together. ‘A’ could be the cause, and ‘B’ could be the effect, or it could be the other way around or just a coincidence. The point is, there is no way to tell without carrying out controlled experiments. Correlation should never be confused with causation.

Similarly, graphs can be manipulated to look impressive without misquoting the data.

These are only a few of the ways in which statistics can be used to lie. This list is only suggestive, not exhaustive. All these methods of bluffing go to show that statistics is as much an art as it is a science.

Data is the new oil. Most decisions in the private and public sectors are based on data and its analysis. Wrong interpretations of data or derivations of incorrect insights will have costly ramifications.

upGrad’s Exclusive Data Science Webinar for you –

ODE Thought Leadership Presentation

 

Read our popular Data Science Articles

In the world of viral marketing, you need to be extra careful about the claims of advertisers. Here too, you need to be aware of the existence of statistics as an art. A little skepticism about the claims of advertisers, combined with a knowledge of how people deploy statistics to tell lies, will inevitably help you make better and more conscious decisions.

Top Data Science Skills to Learn to upskill

Learn data science courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

(This article is inspired by the book How to Lie with Statistics by Darrell Huff).

 

Profile
Thulasiram is a veteran with 20 years of experience in production planning, supply chain management, quality assurance, Information Technology, and training. Trained in Data Analysis from IIIT Bangalore and UpGrad, he is passionate about education and operations and ardent about applying data analytic techniques to improve operational efficiency and effectiveness. Presently, working as Program Associate for Data Analysis at UpGrad.

Frequently Asked Questions (FAQs)

1What does misleading mean in statistics?

Statistics misuse can be unintentional or intentional. While it is almost likely that purposeful effort to blur lines with false information will intensify bias, it is not necessary to have a malevolent goal to generate confusion. Misuse of statistics is a much greater problem that now affects a wide range of enterprises and academic sectors. Here are a few common blunders that lead to misuse like Faulty polling, Flawed correlation, Data Fishing, Misleading Data Visualization, Purposeful bias, Bad Sampling, Selective Data Display, Omitting the Baseline, Simpson’s Paradox, Misleading Graphs.

2How does the use of misleading data affect the business?

Today’s successful business organisations rely on data to make well-informed decisions that provide high-value outcomes. Data can aid in the resolution of issues, the monitoring of performance, the improvement of processes, the resolution of issues, and the acquisition of a better understanding of the market. Poor data quality, on the other hand, might be detrimental to your business. The consequences of using misinterpreted data for your business are wrong business strategies, increased financial costs, loss in productivity, damaged reputation, and missing out on potential opportunities.

3What is the main purpose of Data manipulation?

Sorting, rearranging, and relocating data without affecting it is what data manipulation is all about. It entails transforming data into the format required for displaying data or feeding and training an analytics model. Data manipulation’s main goal is to change the relationship between two data items (logical or physical), not the data itself. Row and column filtering, aggregation, join and concatenation, string manipulation, categorization, regression, and mathematical formulas are some of the most common processes used to manage data.

Explore Free Courses

Suggested Blogs

Priority Queue in Data Structure: Characteristics, Types & Implementation
57468
Introduction The priority queue in the data structure is an extension of the “normal” queue. It is an abstract data type that contains a
Read More

by Rohit Sharma

15 Jul 2024

An Overview of Association Rule Mining & its Applications
142458
Association Rule Mining in data mining, as the name suggests, involves discovering relationships between seemingly independent relational databases or
Read More

by Abhinav Rai

13 Jul 2024

Data Mining Techniques & Tools: Types of Data, Methods, Applications [With Examples]
101687
Why data mining techniques are important like never before? Businesses these days are collecting data at a very striking rate. The sources of this eno
Read More

by Rohit Sharma

12 Jul 2024

17 Must Read Pandas Interview Questions & Answers [For Freshers & Experienced]
58119
Pandas is a BSD-licensed and open-source Python library offering high-performance, easy-to-use data structures, and data analysis tools. The full form
Read More

by Rohit Sharma

11 Jul 2024

Top 7 Data Types of Python | Python Data Types
99373
Data types are an essential concept in the python programming language. In Python, every value has its own python data type. The classification of dat
Read More

by Rohit Sharma

11 Jul 2024

What is Decision Tree in Data Mining? Types, Real World Examples & Applications
16859
Introduction to Data Mining In its raw form, data requires efficient processing to transform into valuable information. Predicting outcomes hinges on
Read More

by Rohit Sharma

04 Jul 2024

6 Phases of Data Analytics Lifecycle Every Data Analyst Should Know About
82806
What is a Data Analytics Lifecycle? Data is crucial in today’s digital world. As it gets created, consumed, tested, processed, and reused, data goes
Read More

by Rohit Sharma

04 Jul 2024

Most Common Binary Tree Interview Questions & Answers [For Freshers & Experienced]
10475
Introduction Data structures are one of the most fundamental concepts in object-oriented programming. To explain it simply, a data structure is a par
Read More

by Rohit Sharma

03 Jul 2024

Data Science Vs Data Analytics: Difference Between Data Science and Data Analytics
70273
Summary: In this article, you will learn, Difference between Data Science and Data Analytics Job roles Skills Career perspectives Which one is right
Read More

by Rohit Sharma

02 Jul 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon