Table of Contents
A Google search for ‘average data scientist salary in India’ will return a happy result.
Does this mean that any person who wants to enter this exotic field can expect this salary? Why not? What’s wrong with expecting to earn a sum claimed by a reputed website? After all, this website may have conducted some extensive research to arrive at this number. Yet, taking a decision based on this claim alone is not a good idea. But why? Read on!
What does “average” mean in the above Google search? Averages come in different flavours. These are mean, median, and mode. Which average does this “national average” refer to? If it is the mean, what can you infer from it? Check a result from another website.
Here it says, “Experience strongly influences income for this job”.
Why is this important?
A person with a rich experience may be drawing a better income than someone without any experience. An individual who graduated from a reputed institute could be earning more than someone who self-learned. There is a fair chance that a person could inflate his/her salary in a survey to boost his/her status. Or, a person could downplay his/her salary for other reasons such as taxes. In such scenarios, using the mean isn’t appropriate.
If you calculate the mean of such salaries, a few outliers will have an undue effect on the average obtained. They will pull the mean up. In such cases, the median is the true representative. It will indicate an equal number of people earning sums below and above it.
In the future, if you come across the word ‘average’ anywhere, look for amplifying information. Check if the author is referring to the mean, median, or mode. Check for confidence intervals and levels of significance. If these are not found, then there is reason enough to be skeptical.
Say, an endorsement specifies the type of average. Can you then take it to be absolute? No? Why not?
Let’s go back to the original statement about the average salary of data scientists. The statement claims to be from a sample of 303 salaries. Exactly one day ago, this number was 12. Is this a sample you can trust?
To conduct a survey or an experiment, the sample needs to be a true representative of the underlying population. The size of the sample must be large enough to confidently draw inferences about the population.
I was watching some lectures by Professor Starbird about statistics. I learned that years ago, a newspaper conducted a survey regarding the presidential elections in the US. This newspaper sent out a questionnaire, analysed it, and published the result that a particular candidate was going to win. After the election, the result was the opposite of what the paper predicted. The candidate predicted by the newspaper lost by a high margin. Subsequently, the newspaper analysed where it went wrong.
The paper’s management found that it only sent the questionnaire to its affluent subscribers. Evidently, they did not represent the whole population. As a consequence, the prediction based on this biased sample became a source of embarrassment for the newspaper.
You can infer whatever results you’d like to see by taking a very small sample! As a very basic example, if you toss a coin 10 times, do you get heads five times and tails five times? You could get seven heads in a row, and maybe this is the result you desire. The ‘law of averages’ will only work (i.e. half heads, half tails) when this coin-tossing experiment is performed a large number of times. In the short term, any result is possible.
If you don’t see information about the sample size along with the type of average, this is a cause for concern. If the sample size is sufficient and is a true representative of the population, then there is no need to hide it.
A report claimed that in a particular college 33% of the male professors married their female students.
We need to be very careful with percentages. If percentages are not accompanied by the actual numbers, they may be misleading. In the college mentioned above, it turned out that only three women studied there, and just one married to a professor. One out of three makes 33%. Always check if percentages are accompanied by the actual numbers. If they are not, then there is a cause for concern.
Another major fallacy in statistics is confusing correlation with causation. If two items are correlated, then the assumption that one causes the other, is wrong.
In a group of aboriginal people, the presence of lice on the body was considered safe. If a person ran a fever in that tribe, it was observed that there were no lice on his/her body. So, the tribe naively assumed that this lack of lice was, in fact, the cause of the fever. Later it was found, when a person suffered from fever, the increased body temperature became uncomfortable for lice. The fever was causing lice to abandon their host; their absence was not the cause of the fever, as assumed.
Say, ‘A’ and ‘B’ are correlated. There could be some other variable ‘C’ that causes ‘A’ and ‘B’ to rise and fall together. ‘A’ could be the cause, and ‘B’ could be the effect, or it could be the other way around or just a coincidence. The point is, there is no way to tell without carrying out controlled experiments. Correlation should never be confused with causation.
Similarly, graphs can be manipulated to look impressive without misquoting the data.
These are only a few of the ways in which statistics can be used to lie. This list is only suggestive, not exhaustive. All these methods of bluffing go to show that statistics is as much an art as it is a science.
Data is the new oil. Most decisions in the private and public sectors are based on data and its analysis. Wrong interpretations of data or derivations of incorrect insights will have costly ramifications.
In the world of viral marketing, you need to be extra careful about the claims of advertisers. Here too, you need to be aware of the existence of statistics as an art. A little skepticism about the claims of advertisers, combined with a knowledge of how people deploy statistics to tell lies, will inevitably help you make better and more conscious decisions.
Learn data science courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.
(This article is inspired by the book How to Lie with Statistics by Darrell Huff).
What does misleading mean in statistics?
Statistics misuse can be unintentional or intentional. While it is almost likely that purposeful effort to blur lines with false information will intensify bias, it is not necessary to have a malevolent goal to generate confusion. Misuse of statistics is a much greater problem that now affects a wide range of enterprises and academic sectors. Here are a few common blunders that lead to misuse like Faulty polling, Flawed correlation, Data Fishing, Misleading Data Visualization, Purposeful bias, Bad Sampling, Selective Data Display, Omitting the Baseline, Simpson’s Paradox, Misleading Graphs.
How does the use of misleading data affect the business?
Today’s successful business organisations rely on data to make well-informed decisions that provide high-value outcomes. Data can aid in the resolution of issues, the monitoring of performance, the improvement of processes, the resolution of issues, and the acquisition of a better understanding of the market. Poor data quality, on the other hand, might be detrimental to your business. The consequences of using misinterpreted data for your business are wrong business strategies, increased financial costs, loss in productivity, damaged reputation, and missing out on potential opportunities.
What is the main purpose of Data manipulation?
Sorting, rearranging, and relocating data without affecting it is what data manipulation is all about. It entails transforming data into the format required for displaying data or feeding and training an analytics model. Data manipulation’s main goal is to change the relationship between two data items (logical or physical), not the data itself. Row and column filtering, aggregation, join and concatenation, string manipulation, categorization, regression, and mathematical formulas are some of the most common processes used to manage data.