Most of the discussions on Data Analysis deal with the “science” aspect of it. Surely, there’s a lot of science behind the whole process – the algorithms, formulas, and calculations, but you can’t take the “art” away from it. Structuring the complete process – from planning the analysis, to making sense of the final result – is no mean feat, and is no less than an art form. That is exactly what comes under our topic for the day – Exploratory Data Analysis. In this article, we’ll be looking at what is exploratory data analysis, what are the common tools and techniques for it, and how does it help an organisation.
What is Exploratory Data Analysis?
Exploratory Data Analysis is one of the important steps in the data analysis process. Here, the focus is on making sense of the data in hand – things like formulating the correct questions to ask to your dataset, how to manipulate the data sources to get the required answers, and others. This is done by taking an elaborate look at trends, patterns, and outliers using a visual method.
Exploratory Data Analysis is a crucial step before you jump to machine learning or modeling of your data. It provides the context needed to develop an appropriate model – and interpret the results correctly.
Over the years, machine learning has been on the rise – and that’s given birth to a number of powerful machine learning algorithms. So powerful that they almost tempt you to skip the Exploratory Data Analysis phase. While it’s understandable why you’d want to take advantage of such algorithms and skip the EDA – It is not a very good idea to just feed data into a black box and wait for the results. It has been observed time and time again that Exploratory Data Analysis provides a lot of critical information which is very easy to miss – information that helps the analysis in the long run, from framing questions to displaying results. If you are a beginner and interested to learn more about data science, check out our data science training from top universities.
While the aspects of EDA have existed as long as we’ve had data to analyse, Exploratory Data Analysis officially was developed back in the 1970s by John Turkey – the same scientist who coined the word “Bit” (short for Binary Digit). EDA is often seen and described as a philosophy more than science because there are no hard-and-fast rules for approaching it. The purpose of Exploratory Data Analysis is essential to tackle specific tasks such as:
- Spotting missing and erroneous data;
- Mapping and understanding the underlying structure of your data;
- Identifying the most important variables in your dataset;
- Testing a hypothesis or checking assumptions related to a specific model;
- Establishing a parsimonious model (one that can explain your data using minimum variables);
- Estimating parameters and figuring the margins of error.
Tools and Techniques used in Exploratory Data Analysis
S-Plus and R are the most important statistical programming languages used to perform Exploratory Data Analysis. These languages come bundled with a plethora of tools that help you perform specific statistical functions like:
Classification and dimension reduction techniques
Classification is essentially used to group together different datasets based on a common parameter/variable. The data we’re talking about is multi-dimensional, and it’s not easy to perform classification or clustering on a multi-dimensional dataset. Hence, to help with that, Dimensionality Reduction techniques like PCA and LDA are performed – these reduce the dimensionality of the dataset without losing out on any valuable information from your data.
Univariate visualisations are essentially probability distributions of each and every field in the raw dataset – with summary statistics. Univariate visualisations use frequency distribution tables, bar charts, histograms, or pie charts for the graphical representation.
These allow the data scientists to assess the relationship between variables in your dataset – and helps you target the variable you’re looking at. Appropriate graphs for Bivariate Analysis depend on the type of variable in question. For instance, if you’re dealing with two continuous variables, a scatter plot should be the graph of your choice. If one is categorical and the other is continuous, a box plot is preferred and when both the variables are categorical, a mosaic plot is chosen.
Multivariate visualizations help in understanding the interactions between different data-fields. It involves observation and analysis of more than one statistical outcome variable at any given time.
K-means clustering is basically used to create “centers” for each cluster based on the nearest mean. It’s an iterative technique that keeps creating and re-creating clusters – until the clusters formed stop changing with iterations. It can be used for finding outliers in a dataset (points that won’t be a form of any clusters will ideally be outliers).
As the name suggests, predictive modeling is a method that uses statistics to predict outcomes. Although most predictions aim to predict what’ll happen in the future, predictive modeling can also be applied to any unknown event, regardless of when it’s likely to occur. For example, this technique can be used to detect crime and identify suspects even after the crime has happened. The most common way of performing predictive modeling is using linear regression (see the image).
How does Exploratory Data Analysis help your business and where does it fit in?
Exploratory Data Analysis provides utmost value to any business by helping scientists understand if the results they’ve produced are correctly interpreted and if they apply to the required business contexts. Other than just ensuring technically sound results, Exploratory Data Analysis also benefits stakeholders by confirming if the questions they’re asking are right or not. Exploratory Data Science often turns up with unpredictable insights – ones that the stakeholders or data scientists wouldn’t even care to investigate in general, but which can still prove to be highly informative about the business.
There are a number of data connectors that help organisations incorporate Exploratory Data Analysis directly into their Business Intelligence software. You can also set this up to allow data to flow the other way too, by building and running statistical models in (for example) R that use BI data and automatically update as new information flows into the model.
Potential use-cases of Exploratory Data Analysis are wide-ranging, but ultimately, it all boils down to this – Exploratory Data Analysis is all about getting to know and understand your data before making any assumptions about it, or taking any steps in the direction of Data Mining. It helps you avoid creating inaccurate models or building accurate models on the wrong data.
Performing this step right will give any organisation the necessary confidence in their data – which will eventually allow them to start deploying powerful machine learning algorithms. However, ignoring this crucial step can lead you to build your Business Intelligence System on a very shaky foundation.
Exploratory Data Analysis is quite clearly one of the important steps during the whole process of knowledge extraction. If you want to set up a strong foundation for your overall analysis process, you should focus with all your strength and might on the EDA phase. In all honesty, a bit of statistics is required to ace this step. If you feel you lag behind on that front, don’t forget to read our article on Basics of Statistics Needed for Data Science.
Learn data science courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.
If you’re interested to learn python & want to get your hands dirty on various tools and libraries, check out Executive PG Program in Data Science. Oh, and what do you feel about our stand of considering “Exploratory Data Analysis” as an art more than science? Let us know in the comments below!
Why should a Data Scientist use Exploratory Data Analysis to improve your business?
The primary goal of Exploratory Data Analysis is to assist in the analysis of data prior to making any assumptions. It can help with the detection of obvious errors, a better comprehension of data patterns, the detection of outliers or unexpected events, and the discovery of interesting correlations between variables.
Data scientists can employ exploratory analysis to ensure that the results they produce are accurate and acceptable for any desired business outcomes and goals. EDA also assists stakeholders by ensuring that they are asking the appropriate questions. Standard deviations, categorical variables, and confidence intervals can all be answered with EDA. Following the completion of EDA and the extraction of insights, its features can be applied to more advanced data analysis or modelling, including machine learning.
What are the most popular use cases for EDA?
It is not uncommon for data scientists to use EDA before tying other types of modelling. It is often used in data analysis to look at datasets to identify outliers, trends, patterns and errors. For example, EDA is commonly used in retail where BI tools and experts analyse data to uncover insights in sale trends, top categories, etc., EDA is also used in health care research to identify new trends in a marketplace or industry, determining strains of flu that may be more prevalent in the new flu season, verifying homogeneity of patient population etc.
What are the types of Exploratory Data Analysis?
The types of Exploratory Data Analysis are
1. Univariate Non- graphical : The standard purpose of univariate non-graphical EDA is to understand the sample distribution/data and make population observations.
2. Univariate graphical : Histograms, Stem-and-leaf plots, Box Plots, etc.
3. Multivariate Non-graphical : These EDA techniques use cross-tabulation or statistics to depict the relationship between two or more data variables.
4. Multivariate graphical : Graphical representations of relationships between two or more types of data are used in multivariate data.