Data science is all about experimenting with raw or structured data. Data is the fuel that can drive a business to the right path or at least provide actionable insights that can help strategize current campaigns, easily organize the launch of new products, or try out different experiments.
All these things have one common driving component and this is Data. We are entering into the digital era where we produce a lot of Data. For instance, a company like Flipkart produces more than 2TB of data on daily basis.
When this Data has so much importance in our life then it becomes important to properly store and process this without any error. When dealing with datasets, the category of data plays an important role to determine which preprocessing strategy would work for a particular set to get the right results or which type of statistical analysis should be applied for the best results. Let’s dive into some of the commonly used categories of data.
Qualitative Data Type
Qualitative or Categorical Data describes the object under consideration using a finite set of discrete classes. It means that this type of data can’t be counted or measured easily using numbers and therefore divided into categories. The gender of a person (male, female, or others) is a good example of this data type.
These are usually extracted from audio, images, or text medium. Another example can be of a smartphone brand that provides information about the current rating, the color of the phone, category of the phone, and so on. All this information can be categorized as Qualitative data. There are two subcategories under this:
These are the set of values that don’t possess a natural ordering. Let’s understand this with some examples. The color of a smartphone can be considered as a nominal data type as we can’t compare one color with others.
It is not possible to state that ‘Red’ is greater than ‘Blue’. The gender of a person is another one where we can’t differentiate between male, female, or others. Mobile phone categories whether it is midrange, budget segment, or premium smartphone is also nominal data type.
Read: Career in Data Science
These types of values have a natural ordering while maintaining their class of values. If we consider the size of a clothing brand then we can easily sort them according to their name tag in the order of small < medium < large. The grading system while marking candidates in a test can also be considered as an ordinal data type where A+ is definitely better than B grade.
These categories help us deciding which encoding strategy can be applied to which type of data. Data encoding for Qualitative data is important because machine learning models can’t handle these values directly and needed to be converted to numerical types as the models are mathematical in nature.
For nominal data type where there is no comparison among the categories, one-hot encoding can be applied which is similar to binary coding considering there are in less number and for the ordinal data type, label encoding can be applied which is a form of integer encoding.
Quantitative Data Type
This data type tries to quantify things and it does by considering numerical values that make it countable in nature. The price of a smartphone, discount offered, number of ratings on a product, the frequency of processor of a smartphone, or ram of that particular phone, all these things fall under the category of Quantitative data types.
The key thing is that there can be an infinite number of values a feature can take. For instance, the price of a smartphone can vary from x amount to any value and it can be further broken down based on fractional values. The two subcategories which describe them clearly are:
The numerical values which fall under are integers or whole numbers are placed under this category. The number of speakers in the phone, cameras, cores in the processor, the number of sims supported all these are some of the examples of the discrete data type.
The fractional numbers are considered as continuous values. These can take the form of the operating frequency of the processors, the android version of the phone, wifi frequency, temperature of the cores, and so on.
Must Read: Data Scientist Salary in India
Can Ordinal and Discrete type overlap?
If you pay attention to this, you can give numbering to the ordinal classes, and then it should be called discrete type or ordinal? The truth is that it is still ordinal. The reason for this is that even if the numbering is done, it doesn’t convey the actual distances between the classes.
For instance, consider the grading system of a test. The respective grades can be A, B, C, D, E, and if we number them from starting then it would be 1,2,3,4,5. Now according to the numerical differences, the distance between E grade and D grade is the same as the distance between the D and C grade which is not very accurate as we all know that C grade is still acceptable as compared to E grade but the mid difference declares them as equal.
You can also apply the same technique to a survey form where user experience is recorded on a scale of very poor to very good. The differences between various classes are not clear therefore can’t be quantified directly.
We have discussed all the major classifications of Data. This is important because now we can prioritize the tests to be performed on different categories. Now it makes sense to plot a histogram or frequency plot for quantitive data and a pie chart and bar plot for qualitative data.
Regression analysis, where the relationship between one dependent and two or more independent variables is analyzed is possible only for quantitative data. ANOVA test (Analysis of variance) test is applicable only on qualitative variables though you can apply two-way ANOVA test which uses one measurement variable and two nominal variables.
In this way, you can apply the Chi-square test on qualitative data to discover relationships between categorical variables.
In this article, we discussed how the data we produce can turn the tables upside down, how the various categories of data are arranged according to their need. We also looked at how ordinal data types can overlap with the discrete data types.
What type of plot is suitable for which category of data was also discussed along with various types of test that can be applied on specific data type and other tests that uses all types of data.
If you are curious about learning data science to be in the front of fast-paced technological advancements, check out upGrad & IIIT-B’s Advanced Certification in Data Science