A Google search for ‘average data scientist salary in India’ returned the following result:
Does this mean that any person who wants to enter this exotic field can expect this salary? Why not? What’s wrong with expecting to earn a sum claimed by a reputed website? After all, this website may have conducted some extensive research to arrive at this number. Yet, taking a decision based on this claim alone is not a good idea. But why? Read on!
What does “average” mean in the above Google search? Averages come in different flavours. These are mean, median, and mode. Which average does this “national average” refer to? If it is the mean, what can you infer from it? Check a result from another website.
Here it says, “Experience strongly influences income for this job”.
Why is this important?
A person with a rich experience may be drawing a better income than someone without any experience. An individual who graduated from a reputed institute could be earning more than someone who self-learned. There is a fair chance that a person could inflate his/her salary in a survey to boost his/her status. Or, a person could downplay his/her salary for other reasons such as taxes. In such scenarios, using the mean isn’t appropriate.
If you calculate the mean of such salaries, a few outliers will have an undue effect on the average obtained. They will pull the mean up. In such cases, the median is the true representative. It will indicate an equal number of people earning sums below and above it.
In the future, if you come across the word ‘average’ anywhere, look for amplifying information. Check if the author is referring to the mean, median, or mode. Check for confidence intervals and levels of significance. If these are not found, then there is reason enough to be skeptical.
Say, an endorsement specifies the type of average. Can you then take it to be absolute? No? Why not?
Let’s go back to the original statement about the average salary of data scientists. The statement claims to be from a sample of 303 salaries. Exactly one day ago, this number was 12. Is this a sample you can trust?
To conduct a survey or an experiment, the sample needs to be a true representative of the underlying population. The size of the sample must be large enough to confidently draw inferences about the population.
I was watching some lectures by Professor Starbird about statistics. I learned that years ago, a newspaper conducted a survey regarding the presidential elections in the US. This newspaper sent out a questionnaire, analysed it, and published the result that a particular candidate was going to win. After the election, the result was the opposite of what the paper predicted. The candidate predicted by the newspaper lost by a high margin. Subsequently, the newspaper analysed where it went wrong.
The paper’s management found that it only sent the questionnaire to its affluent subscribers. Evidently, they did not represent the whole population. As a consequence, the prediction based on this biased sample became a source of embarrassment for the newspaper.
You can infer whatever results you’d like to see by taking a very small sample! As a very basic example, if you toss a coin 10 times, do you get heads five times and tails five times? You could get seven heads in a row, and maybe this is the result you desire. The ‘law of averages’ will only work (i.e. half heads, half tails) when this coin-tossing experiment is performed a large number of times. In the short term, any result is possible.
If you don’t see information about the sample size along with the type of average, this is a cause for concern. If the sample size is sufficient and is a true representative of the population, then there is no need to hide it.
A report claimed that in a particular college 33% of the male professors married their female students.
We need to be very careful with percentages. If percentages are not accompanied by the actual numbers, they may be misleading. In the college mentioned above, it turned out that only three women studied there, and just one married to a professor. One out of three makes 33%. Always check if percentages are accompanied by the actual numbers. If they are not, then there is a cause for concern.
Another major fallacy in statistics is confusing correlation with causation. If two items are correlated, then the assumption that one causes the other, is wrong.
In a group of aboriginal people, the presence of lice on the body was considered safe. If a person ran a fever in that tribe, it was observed that there were no lice on his/her body. So, the tribe naively assumed that this lack of lice was, in fact, the cause of the fever. Later it was found, when a person suffered from fever, the increased body temperature became uncomfortable for lice. The fever was causing lice to abandon their host; their absence was not the cause of the fever, as assumed.
Say, ‘A’ and ‘B’ are correlated. There could be some other variable ‘C’ that causes ‘A’ and ‘B’ to rise and fall together. ‘A’ could be the cause, and ‘B’ could be the effect, or it could be the other way around or just a coincidence. The point is, there is no way to tell without carrying out controlled experiments. Correlation should never be confused with causation.
Similarly, graphs can be manipulated to look impressive without misquoting the data.
These are only a few of the ways in which statistics can be used to lie. This list is only suggestive, not exhaustive. All these methods of bluffing go to show that statistics is as much an art as it is a science.
Data is the new oil. Most decisions in the private and public sectors are based on data and its analysis. Wrong interpretations of data or derivations of incorrect insights will have costly ramifications.
In the world of viral marketing, you need to be extra careful about the claims of advertisers. Here too, you need to be aware of the existence of statistics as an art. A little skepticism about the claims of advertisers, combined with a knowledge of how people deploy statistics to tell lies, will inevitably help you make better and more conscious decisions.
(This article is inspired by the book How to Lie with Statistics by Darrell Huff).
Latest posts by Thulasiram Gunipati (see all)
- A Brilliant Future Scope of Machine Learning - July 18, 2019
- Data Analyst vs Data Scientist – Spot the Difference - July 8, 2019
- Applications of Data Science and Machine Learning in NETFLIX - August 21, 2018