Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconData Sciencebreadcumb forward arrow iconData Manipulation: How Can You Spot Data Lies?

Data Manipulation: How Can You Spot Data Lies?

Last updated:
24th Oct, 2017
Views
Read Time
7 Mins
share image icon
In this article
Chevron in toc
View All
Data Manipulation: How Can You Spot Data Lies?

A Google search for ‘average data scientist salary in India’ will return a happy result. 

Does this mean that any person who wants to enter this exotic field can expect this salary? Why not? What’s wrong with expecting to earn a sum claimed by a reputed website? After all, this website may have conducted some extensive research to arrive at this number. Yet, taking a decision based on this claim alone is not a good idea. But why? Read on!

What does “average” mean in the above Google search? Averages come in different flavours. These are mean, median, and mode. Which average does this “national average” refer to? If it is the mean, what can you infer from it? Check a result from another website.

Here it says, “Experience strongly influences income for this job”.

Why is this important?

A person with a rich experience may be drawing a better income than someone without any experience. An individual who graduated from a reputed institute could be earning more than someone who self-learned. There is a fair chance that a person could inflate his/her salary in a survey to boost his/her status. Or, a person could downplay his/her salary for other reasons such as taxes. In such scenarios, using the mean isn’t appropriate.

If you calculate the mean of such salaries, a few outliers will have an undue effect on the average obtained. They will pull the mean up. In such cases, the median is the true representative. It will indicate an equal number of people earning sums below and above it.

In the future, if you come across the word ‘average’ anywhere, look for amplifying information. Check if the author is referring to the mean, median, or mode. Check for confidence intervals and levels of significance. If these are not found, then there is reason enough to be skeptical.

 

Say, an endorsement specifies the type of average. Can you then take it to be absolute? No? Why not?

Let’s go back to the original statement about the average salary of data scientists. The statement claims to be from a sample of 303 salaries. Exactly one day ago, this number was 12. Is this a sample you can trust?

To conduct a survey or an experiment, the sample needs to be a true representative of the underlying population. The size of the sample must be large enough to confidently draw inferences about the population.
I was watching some lectures by Professor Starbird about statistics. I learned that years ago, a newspaper conducted a survey regarding the presidential elections in the US. This newspaper sent out a questionnaire, analysed it, and published the result that a particular candidate was going to win. After the election, the result was the opposite of what the paper predicted. The candidate predicted by the newspaper lost by a high margin. Subsequently, the newspaper analysed where it went wrong.

The paper’s management found that it only sent the questionnaire to its affluent subscribers. Evidently, they did not represent the whole population. As a consequence, the prediction based on this biased sample became a source of embarrassment for the newspaper.

You can infer whatever results you’d like to see by taking a very small sample! As a very basic example, if you toss a coin 10 times, do you get heads five times and tails five times? You could get seven heads in a row, and maybe this is the result you desire. The ‘law of averages’ will only work (i.e. half heads, half tails) when this coin-tossing experiment is performed a large number of times. In the short term, any result is possible.

If you don’t see information about the sample size along with the type of average, this is a cause for concern. If the sample size is sufficient and is a true representative of the population, then there is no need to hide it.

The Art of Statistics Data Sciences UpGrad Blog
A report claimed that in a particular college 33% of the male professors married their female students.

We need to be very careful with percentages. If percentages are not accompanied by the actual numbers, they may be misleading. In the college mentioned above, it turned out that only three women studied there, and just one married to a professor. One out of three makes 33%. Always check if percentages are accompanied by the actual numbers. If they are not, then there is a cause for concern.

Another major fallacy in statistics is confusing correlation with causation. If two items are correlated, then the assumption that one causes the other, is wrong.
In a group of aboriginal people, the presence of lice on the body was considered safe. If a person ran a fever in that tribe, it was observed that there were no lice on his/her body. So, the tribe naively assumed that this lack of lice was, in fact, the cause of the fever. Later it was found, when a person suffered from fever, the increased body temperature became uncomfortable for lice. The fever was causing lice to abandon their host; their absence was not the cause of the fever, as assumed.

Explore our Popular Data Science Courses

Our learners also read: Top Python Free Courses

Say, ‘A’ and ‘B’ are correlated. There could be some other variable ‘C’ that causes ‘A’ and ‘B’ to rise and fall together. ‘A’ could be the cause, and ‘B’ could be the effect, or it could be the other way around or just a coincidence. The point is, there is no way to tell without carrying out controlled experiments. Correlation should never be confused with causation.

Similarly, graphs can be manipulated to look impressive without misquoting the data.

These are only a few of the ways in which statistics can be used to lie. This list is only suggestive, not exhaustive. All these methods of bluffing go to show that statistics is as much an art as it is a science.

Data is the new oil. Most decisions in the private and public sectors are based on data and its analysis. Wrong interpretations of data or derivations of incorrect insights will have costly ramifications.

upGrad’s Exclusive Data Science Webinar for you –

ODE Thought Leadership Presentation

 

Read our popular Data Science Articles

In the world of viral marketing, you need to be extra careful about the claims of advertisers. Here too, you need to be aware of the existence of statistics as an art. A little skepticism about the claims of advertisers, combined with a knowledge of how people deploy statistics to tell lies, will inevitably help you make better and more conscious decisions.

Top Data Science Skills to Learn to upskill

Learn data science courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

(This article is inspired by the book How to Lie with Statistics by Darrell Huff).

 

Profile
Thulasiram is a veteran with 20 years of experience in production planning, supply chain management, quality assurance, Information Technology, and training. Trained in Data Analysis from IIIT Bangalore and UpGrad, he is passionate about education and operations and ardent about applying data analytic techniques to improve operational efficiency and effectiveness. Presently, working as Program Associate for Data Analysis at UpGrad.

Frequently Asked Questions (FAQs)

1What does misleading mean in statistics?

Statistics misuse can be unintentional or intentional. While it is almost likely that purposeful effort to blur lines with false information will intensify bias, it is not necessary to have a malevolent goal to generate confusion. Misuse of statistics is a much greater problem that now affects a wide range of enterprises and academic sectors. Here are a few common blunders that lead to misuse like Faulty polling, Flawed correlation, Data Fishing, Misleading Data Visualization, Purposeful bias, Bad Sampling, Selective Data Display, Omitting the Baseline, Simpson’s Paradox, Misleading Graphs.

2How does the use of misleading data affect the business?

Today’s successful business organisations rely on data to make well-informed decisions that provide high-value outcomes. Data can aid in the resolution of issues, the monitoring of performance, the improvement of processes, the resolution of issues, and the acquisition of a better understanding of the market. Poor data quality, on the other hand, might be detrimental to your business. The consequences of using misinterpreted data for your business are wrong business strategies, increased financial costs, loss in productivity, damaged reputation, and missing out on potential opportunities.

3What is the main purpose of Data manipulation?

Sorting, rearranging, and relocating data without affecting it is what data manipulation is all about. It entails transforming data into the format required for displaying data or feeding and training an analytics model. Data manipulation’s main goal is to change the relationship between two data items (logical or physical), not the data itself. Row and column filtering, aggregation, join and concatenation, string manipulation, categorization, regression, and mathematical formulas are some of the most common processes used to manage data.

Explore Free Courses

Suggested Blogs

Python Developer Salary in India in 2024 [For Freshers & Experienced]
906195
Wondering what is the range of Python developer salary in India? Before going deep into that, do you know why Python is so popular now? Python has be
Read More

by Sriram

11 Feb 2024

6 Types of Filters in Tableau: How You Should Use Them
64370
Tableau is one of the most popular tools in data visualization and analysis that facilitates brands across all domains to leverage the reckoning poten
Read More

by Rohit Sharma

04 Feb 2024

Data Cleaning Techniques: Learn Simple & Effective Ways To Clean Data
51777
Data cleansing is an essential part of data science. Working with impure data can lead to many difficulties. And today, we’ll be discussing the same.
Read More

by Rohit Sharma

04 Feb 2024

13 Exciting Data Science Project Ideas &  Topics for Beginners [2024]
944750
Summary: In this Article, you will learn about 13 exciting data science project ideas & topics for beginners. 1. Beginner Level | Data Science P
Read More

by Rohit Sharma

28 Jan 2024

Top 15 Python AI & Machine Learning Open Source Projects
35770
Machine learning and artificial intelligence are some of the most advanced topics to learn. So you must employ the best learning methods to make sure
Read More

by Pavan Vadapalli

28 Jan 2024

Most Common Binary Tree Interview Questions & Answers [For Freshers & Experienced]
4302
fIntroduction Data structures are one of the most fundamental concepts in object-oriented programming. To explain it simply, a data structure is a pa
Read More

by Rohit Sharma

28 Jan 2024

Cluster Analysis in Data Mining: Applications, Methods & Requirements
110018
Here we are going to discuss Cluster Analysis in Data Mining. So first let us know about what is clustering in data mining then its introduction and t
Read More

by Rohit Sharma

26 Jan 2024

What is Linear Data Structure? List of Data Structures Explained
52981
Data structures are the data structured in a way for efficient use by the users. As the computer program relies hugely on the data and also requires a
Read More

by Rohit Sharma

24 Jan 2024

Python Free Online Course with Certification [2024]
129115
Summary: In this Article, you will learn about python free online course with certification. Programming with Python: Introduction for Beginners Le
Read More

by Rohit Sharma

24 Jan 2024

Want to build a career in Data Science?Download Career Growth Report
icon
footer sticky close icon