Most of the discussions on Data Analysis deal with the “science” aspect of it. Surely, there’s a lot of science behind the whole process – the algorithms, formulas, and calculations, but you can’t take the “art” away from it. Structuring the complete process – from planning the analysis, to making sense of the final result – is no mean feat, and is no less than an art form. That is exactly what comes under our topic for the day – Exploratory Data Analysis. In this article, we’ll be looking at what is exploratory data analysis, what are the common tools and techniques for it, and how does it help an organisation.
What is Exploratory Data Analysis?
Exploratory Data Analysis is one of the important steps in the data analysis process. Here, the focus is on making sense of the data in hand – things like formulating the correct questions to ask to your dataset, how to manipulate the data sources to get the required answers, and others. This is done by taking an elaborate look at trends, patterns, and outliers using a visual method.
Exploratory Data Analysis is a crucial step before you jump to machine learning or modeling of your data. It provides the context needed to develop an appropriate model – and interpret the results correctly.
Over the years, machine learning has been on the rise – and that’s given birth to a number of powerful machine learning algorithms. So powerful that they almost tempt you to skip the Exploratory Data Analysis phase. While it’s understandable why you’d want to take advantage of such algorithms and skip the EDA – It is not a very good idea to just feed data into a black box and wait for the results. It has been observed time and time again that Exploratory Data Analysis provides a lot of critical information which is very easy to miss – information that helps the analysis in the long run, from framing questions to displaying results.
While the aspects of EDA have existed as long as we’ve had data to analyse, Exploratory Data Analysis officially was developed back in the 1970s by John Turkey – the same scientist who coined the word “Bit” (short for Binary Digit). EDA is often seen and described as a philosophy more than science because there are no hard-and-fast rules for approaching it. The purpose of Exploratory Data Analysis is essential to tackle specific tasks such as:
- Spotting missing and erroneous data;
- Mapping and understanding the underlying structure of your data;
- Identifying the most important variables in your dataset;
- Testing a hypothesis or checking assumptions related to a specific model;
- Establishing a parsimonious model (one that can explain your data using minimum variables);
- Estimating parameters and figuring the margins of error.
Tools and Techniques used in Exploratory Data Analysis
S-Plus and R are the most important statistical programming languages used to perform Exploratory Data Analysis. These languages come bundled with a plethora of tools that help you perform specific statistical functions like:
Classification and dimension reduction techniques
Classification is essentially used to group together different datasets based on a common parameter/variable. The data we’re talking about is multi-dimensional, and it’s not easy to perform classification or clustering on a multi-dimensional dataset. Hence, to help with that, Dimensionality Reduction techniques like PCA and LDA are performed – these reduce the dimensionality of the dataset without losing out on any valuable information from your data.
Univariate visualisations are essentially probability distributions of each and every field in the raw dataset – with summary statistics. Univariate visualisations use frequency distribution tables, bar charts, histograms, or pie charts for the graphical representation.
These allow the data scientists to assess the relationship between variables in your dataset – and helps you target the variable you’re looking at. Appropriate graphs for Bivariate Analysis depend on the type of variable in question. For instance, if you’re dealing with two continuous variables, a scatter plot should be the graph of your choice. If one is categorical and the other is continuous, a box plot is preferred and when both the variables are categorical, a mosaic plot is chosen.
Multivariate visualizations help in understanding the interactions between different data-fields. It involves observation and analysis of more than one statistical outcome variable at any given time.
K-means clustering is basically used to create “centers” for each cluster based on the nearest mean. It’s an iterative technique that keeps creating and re-creating clusters – until the clusters formed stop changing with iterations. It can be used for finding outliers in a dataset (points that won’t be a form of any clusters will ideally be outliers).
As the name suggests, predictive modeling is a method that uses statistics to predict outcomes. Although most predictions aim to predict what’ll happen in the future, predictive modeling can also be applied to any unknown event, regardless of when it’s likely to occur. For example, this technique can be used to detect crime and identify suspects even after the crime has happened. The most common way of performing predictive modeling is using linear regression (see the image).
How does Exploratory Data Analysis help your business and where does it fit in?
Exploratory Data Analysis provides utmost value to any business by helping scientists understand if the results they’ve produced are correctly interpreted and if they apply to the required business contexts. Other than just ensuring technically sound results, Exploratory Data Analysis also benefits stakeholders by confirming if the questions they’re asking are right or not. Exploratory Data Science often turns up with unpredictable insights – ones that the stakeholders or data scientists wouldn’t even care to investigate in general, but which can still prove to be highly informative about the business.
There are a number of data connectors that help organisations incorporate Exploratory Data Analysis directly into their Business Intelligence software. You can also set this up to allow data to flow the other way too, by building and running statistical models in (for example) R that use BI data and automatically update as new information flows into the model.
Potential use-cases of Exploratory Data Analysis are wide-ranging, but ultimately, it all boils down to this – Exploratory Data Analysis is all about getting to know and understand your data before making any assumptions about it, or taking any steps in the direction of Data Mining. It helps you avoid creating inaccurate models or building accurate models on the wrong data.
Performing this step right will give any organisation the necessary confidence in their data – which will eventually allow them to start deploying powerful machine learning algorithms. However, ignoring this crucial step can lead you to build your Business Intelligence System on a very shaky foundation.
Exploratory Data Analysis is quite clearly one of the important steps during the whole process of knowledge extraction. If you want to set up a strong foundation for your overall analysis process, you should focus with all your strength and might on the EDA phase. In all honesty, a bit of statistics is required to ace this step. If you feel you lag behind on that front, don’t forget to read our article on Basics of Statistics Needed for Data Science.
Oh, and what do you feel about our stand of considering “Exploratory Data Analysis” as an art more than science? Let us know in the comments below!