Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconData Sciencebreadcumb forward arrow iconData Manipulation: How Can You Spot Data Lies?

Data Manipulation: How Can You Spot Data Lies?

Last updated:
24th Oct, 2017
Views
Read Time
7 Mins
share image icon
In this article
Chevron in toc
View All
Data Manipulation: How Can You Spot Data Lies?

A Google search for ‘average data scientist salary in India’ will return a happy result. 

Does this mean that any person who wants to enter this exotic field can expect this salary? Why not? What’s wrong with expecting to earn a sum claimed by a reputed website? After all, this website may have conducted some extensive research to arrive at this number. Yet, taking a decision based on this claim alone is not a good idea. But why? Read on!

What does “average” mean in the above Google search? Averages come in different flavours. These are mean, median, and mode. Which average does this “national average” refer to? If it is the mean, what can you infer from it? Check a result from another website.

Here it says, “Experience strongly influences income for this job”.

Why is this important?

A person with a rich experience may be drawing a better income than someone without any experience. An individual who graduated from a reputed institute could be earning more than someone who self-learned. There is a fair chance that a person could inflate his/her salary in a survey to boost his/her status. Or, a person could downplay his/her salary for other reasons such as taxes. In such scenarios, using the mean isn’t appropriate.

If you calculate the mean of such salaries, a few outliers will have an undue effect on the average obtained. They will pull the mean up. In such cases, the median is the true representative. It will indicate an equal number of people earning sums below and above it.

In the future, if you come across the word ‘average’ anywhere, look for amplifying information. Check if the author is referring to the mean, median, or mode. Check for confidence intervals and levels of significance. If these are not found, then there is reason enough to be skeptical.

 

Say, an endorsement specifies the type of average. Can you then take it to be absolute? No? Why not?

Let’s go back to the original statement about the average salary of data scientists. The statement claims to be from a sample of 303 salaries. Exactly one day ago, this number was 12. Is this a sample you can trust?

To conduct a survey or an experiment, the sample needs to be a true representative of the underlying population. The size of the sample must be large enough to confidently draw inferences about the population.
I was watching some lectures by Professor Starbird about statistics. I learned that years ago, a newspaper conducted a survey regarding the presidential elections in the US. This newspaper sent out a questionnaire, analysed it, and published the result that a particular candidate was going to win. After the election, the result was the opposite of what the paper predicted. The candidate predicted by the newspaper lost by a high margin. Subsequently, the newspaper analysed where it went wrong.

The paper’s management found that it only sent the questionnaire to its affluent subscribers. Evidently, they did not represent the whole population. As a consequence, the prediction based on this biased sample became a source of embarrassment for the newspaper.

You can infer whatever results you’d like to see by taking a very small sample! As a very basic example, if you toss a coin 10 times, do you get heads five times and tails five times? You could get seven heads in a row, and maybe this is the result you desire. The ‘law of averages’ will only work (i.e. half heads, half tails) when this coin-tossing experiment is performed a large number of times. In the short term, any result is possible.

If you don’t see information about the sample size along with the type of average, this is a cause for concern. If the sample size is sufficient and is a true representative of the population, then there is no need to hide it.

The Art of Statistics Data Sciences UpGrad Blog
A report claimed that in a particular college 33% of the male professors married their female students.

We need to be very careful with percentages. If percentages are not accompanied by the actual numbers, they may be misleading. In the college mentioned above, it turned out that only three women studied there, and just one married to a professor. One out of three makes 33%. Always check if percentages are accompanied by the actual numbers. If they are not, then there is a cause for concern.

Another major fallacy in statistics is confusing correlation with causation. If two items are correlated, then the assumption that one causes the other, is wrong.
In a group of aboriginal people, the presence of lice on the body was considered safe. If a person ran a fever in that tribe, it was observed that there were no lice on his/her body. So, the tribe naively assumed that this lack of lice was, in fact, the cause of the fever. Later it was found, when a person suffered from fever, the increased body temperature became uncomfortable for lice. The fever was causing lice to abandon their host; their absence was not the cause of the fever, as assumed.

Explore our Popular Data Science Courses

Our learners also read: Top Python Free Courses

Say, ‘A’ and ‘B’ are correlated. There could be some other variable ‘C’ that causes ‘A’ and ‘B’ to rise and fall together. ‘A’ could be the cause, and ‘B’ could be the effect, or it could be the other way around or just a coincidence. The point is, there is no way to tell without carrying out controlled experiments. Correlation should never be confused with causation.

Similarly, graphs can be manipulated to look impressive without misquoting the data.

These are only a few of the ways in which statistics can be used to lie. This list is only suggestive, not exhaustive. All these methods of bluffing go to show that statistics is as much an art as it is a science.

Data is the new oil. Most decisions in the private and public sectors are based on data and its analysis. Wrong interpretations of data or derivations of incorrect insights will have costly ramifications.

upGrad’s Exclusive Data Science Webinar for you –

ODE Thought Leadership Presentation

 

Read our popular Data Science Articles

In the world of viral marketing, you need to be extra careful about the claims of advertisers. Here too, you need to be aware of the existence of statistics as an art. A little skepticism about the claims of advertisers, combined with a knowledge of how people deploy statistics to tell lies, will inevitably help you make better and more conscious decisions.

Top Data Science Skills to Learn to upskill

Learn data science courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

(This article is inspired by the book How to Lie with Statistics by Darrell Huff).

 

Profile
Thulasiram is a veteran with 20 years of experience in production planning, supply chain management, quality assurance, Information Technology, and training. Trained in Data Analysis from IIIT Bangalore and UpGrad, he is passionate about education and operations and ardent about applying data analytic techniques to improve operational efficiency and effectiveness. Presently, working as Program Associate for Data Analysis at UpGrad.

Frequently Asked Questions (FAQs)

1What does misleading mean in statistics?

Statistics misuse can be unintentional or intentional. While it is almost likely that purposeful effort to blur lines with false information will intensify bias, it is not necessary to have a malevolent goal to generate confusion. Misuse of statistics is a much greater problem that now affects a wide range of enterprises and academic sectors. Here are a few common blunders that lead to misuse like Faulty polling, Flawed correlation, Data Fishing, Misleading Data Visualization, Purposeful bias, Bad Sampling, Selective Data Display, Omitting the Baseline, Simpson’s Paradox, Misleading Graphs.

2How does the use of misleading data affect the business?

Today’s successful business organisations rely on data to make well-informed decisions that provide high-value outcomes. Data can aid in the resolution of issues, the monitoring of performance, the improvement of processes, the resolution of issues, and the acquisition of a better understanding of the market. Poor data quality, on the other hand, might be detrimental to your business. The consequences of using misinterpreted data for your business are wrong business strategies, increased financial costs, loss in productivity, damaged reputation, and missing out on potential opportunities.

3What is the main purpose of Data manipulation?

Sorting, rearranging, and relocating data without affecting it is what data manipulation is all about. It entails transforming data into the format required for displaying data or feeding and training an analytics model. Data manipulation’s main goal is to change the relationship between two data items (logical or physical), not the data itself. Row and column filtering, aggregation, join and concatenation, string manipulation, categorization, regression, and mathematical formulas are some of the most common processes used to manage data.

Explore Free Courses

Suggested Blogs

Data Science for Beginners: A Comprehensive Guide
5015
Data science is an important part of many industries today. Having worked as a data scientist for several years, I have witnessed the massive amounts
Read More

by Harish K

28 Feb 2024

6 Best Data Science Institutes in 2024 (Detailed Guide)
5020
Data science training is one of the most hyped skills in today’s world. Based on my experience as a data scientist, it’s evident that we are in
Read More

by Harish K

28 Feb 2024

Data Science Course Fees: The Roadmap to Your Analytics Career
5036
A data science course syllabus covers several basic and advanced concepts of statistics, data analytics, machine learning, and programming languages.
Read More

by Harish K

28 Feb 2024

Inheritance in Python | Python Inheritance [With Example]
17097
Python is one of the most popular programming languages. Despite a transition full of ups and downs from the Python 2 version to Python 3, the Object-
Read More

by Rohan Vats

27 Feb 2024

Data Mining Architecture: Components, Types & Techniques
10582
Introduction Data mining is the process in which information that was previously unknown, which could be potentially very useful, is extracted from a
Read More

by Rohit Sharma

27 Feb 2024

6 Phases of Data Analytics Lifecycle Every Data Analyst Should Know About
79394
What is a Data Analytics Lifecycle? Data is crucial in today’s digital world. As it gets created, consumed, tested, processed, and reused, data goes
Read More

by Rohit Sharma

19 Feb 2024

Sorting in Data Structure: Categories & Types [With Examples]
137467
The arrangement of data in a preferred order is called sorting in the data structure. By sorting data, it is easier to search through it quickly and e
Read More

by Rohit Sharma

19 Feb 2024

Data Science Vs Data Analytics: Difference Between Data Science and Data Analytics
67758
Summary: In this article, you will learn, Difference between Data Science and Data Analytics Job roles Skills Career perspectives Which one is right
Read More

by Rohit Sharma

19 Feb 2024

13 Exciting Python Projects on Github You Should Try Today [2023]
44747
Python is one of the top choices in programming languages among professionals worldwide. Its straightforward syntax allows software developers and dat
Read More

by Hemant

19 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon