Many professionals and ‘Data’ enthusiasts often ask, “What’s the difference between Data Science, Machine Learning and Big Data?” This is a question frequently asked nowadays.
Here’s what differentiates Data Science, Machine Learning and Big Data from each other:
Data Science follows an interdisciplinary approach. It lies at the intersection of Maths, Statistics, Artificial Intelligence, Software Engineering and Design Thinking. Data Science deals with data collection, cleaning, analysis, visualisation, model creation, model validation, prediction, designing experiments, hypothesis testing and much more. The aim of all these steps is just to derive insights from data.
Digitisation is progressing at an exponential rate. Internet accessibility is improving at breakneck speed. More and more people are getting absorbed into the digital ecosystem. All these activities are generating a humongous amount of data. Companies are currently sitting on a data landmine. But data, by itself, is not of much use. This is where Data Science comes into the picture. It helps in mining this data and deriving insights from it; for taking meaningful action. Various Data Science tools can help us in the process of insight generation.
Frameworks exist to help derive insights from data. A framework is nothing but a supportive structure. It’s a lifecycle used to structure the development of Data Science projects. A lifecycle outlines the steps — from start to finish — that projects usually follow. In other words, it breaks down the complex challenges into simple steps.
This ensures that any significant phase, which leads to the generation of actionable insights from data, is not missed out.
One such framework is the ‘Cross Industry Standard Process for Data Mining’, abbreviated as the CRISP-DM framework. The other is the ‘Team Data Science Process’ (TDSP) from Microsoft.
Let’s understand this with the help of an example. A bank named ‘X’, which has been in business for the past ten years. It receives a loan application from one of its customers. Now, it wants to predict whether this customer will default in repaying the loan. How can the bank go about achieving this task?
Like every other bank, X must have captured data regarding various aspects of their customers, such as demographic data, customer-related data, etc. In the past ten years, many customers would have succeeded in repaying the loan, but some customers would have defaulted. How can this bank leverage this data to improve its profitability? To put it simply, how can it avoid providing loans to a customer who is very likely to default? How can they ensure not losing out on good customers who are more likely to repay their debts? Data Science can help us resolve this challenge.
Raw Data —> Data Science —-> Actionable Insights
Let’s understand how various branches of Data Science will help the bank overcome its challenge. Statistics will assist in the designing of experiments, finding a correlation between variables, hypothesis testing, exploratory data analysis, etc. In this case, the loan purpose or educational qualifications of the customer could influence their loan default. After performing data cleaning and exploratory study, the data becomes ready for modeling.
Statistics and artificial intelligence provide algorithms for model creation. Model creation is where machine learning comes into the picture. Machine learning is a branch of artificial intelligence that is utilised by data science to achieve its objectives. Before proceeding with the banking example, let’s understand what machine learning is.
“Machine learning is a form of artificial intelligence. It gives machines the ability to learn, without being explicitly programmed.”
How can machines learn without being explicitly programmed, you might ask? Aren’t computers just devices made to follow instructions? Not anymore.
Machine learning consists of a suite of intelligent algorithms, enabling machines to learn without being explicitly programmed for it. Machine learning helps you learn the objective function — which maps the inputs to the target variable, or independent variables to the dependent variables.
In our banking example, the objective function determines the various demographics, customer and behavioural variables which influences the probability of a loan default. Independent attributes or inputs are the demographic, customer and behavioural variables of a customer. The dependent variable is either ‘to default’ or not. The objective function is an equation which maps these inputs to outputs. It’s a function which tells us which independent variables influence the dependent variable, i.e. the tendency to default. This process of deriving an objective function, which maps inputs to outputs is known as modelling.
Initially, this objective function will not be able to predict precisely whether a customer will default or not. As the model encounters new instances, it learns and evolves. It improves as more and more examples become available. Ultimately, this model reaches a stage where it will be able to tell with a certain degree of precision.
hings like, which customer is going to default, and whom the bank can rely on to improve its profitability.
Machine learning aims to achieve ‘generalisability’. This means, the objective function — which maps the inputs to the output — should apply to the data, which hasn’t encountered it, yet. In the banking example, our model learns patterns from the data provided to it. The model determines which variables will influence the tendency to default. If a new customer applies for a loan, at this point, his/her variables are not yet seen by this model. The model should be relevant to this customer as well. It should predict reliably whether this customer will default or not.
If this model is unable to do this, then it will not able to generalise the unseen data. It is an iterative process. We need to create many models to see which work, and which don’t.
Data science and analysis utilise machine learning for this kind of model creation and validation. It is important to note that all the algorithms for this model creation do not come from machine learning. They can enter from various other fields. The model needs to be kept relevant at all times. If the conditions change, then the model — which we created earlier — may become irrelevant.
The model needs to be checked for its predictability at different times and needs to be modified if its predictability reduces. For the banking employee to take an instant decision the moment a customer applies for a loan, the model needs to be integrated with the bank’s IT systems. The bank’s servers should host the model. As a customer applies for a loan, his variables must be captured from a website and utilised by the model running on the server.
Then, this model should convey the decision — whether the credit can be granted or not — to the bank employee, instantly. This process comes under the domain of information technology, which is also utilised by data science.
In the end, it is all about communicating the results from the analysis. Here, the presentation and storytelling skills are required to demonstrate the effects from the study efficiently. Design-thinking helps in visualising the results, and effectively tell the story from the analysis.
The final piece of our puzzle is ‘Big Data’. How is it different from data science and machine learning?
According to IBM, we create 2.5 Quintillion (2.5 × 1018) bytes of data every day! The amount of data which companies gather is so vast that it creates a large set of challenges regarding data acquisition, storage, analysis and visualisation. The problem is not entirely regarding the quantity of data that is available, but also its variety, veracity and velocity. All these challenges necessitated a new set of methods and techniques to deal with the same.
Big data involves the four ‘V’s — Volume, Variety, Veracity, and Velocity — which differentiates it from conventional data.
The amount of data involved here is so humongous, that it requires specialised infrastructure to acquire, store and analyse it. Distributed and parallel computing methods are employed to handle this volume of data.
Data comes in various formats; structured or unstructured, etc. Structured means neatly arranged rows and columns. Unstructured means that it comes in the form of paragraphs, videos and images, etc. This kind of data also consists of a lot of information. Unstructured data requires different database systems than traditional RDBMS. Cassandra is one such database to manage unstructured data.
The presence of huge volumes of data will not lead to actionable insights. It needs to be correct for it to be meaningful. Extreme care needs to be taken to make sure that the data captured is accurate, and that the sanctity is maintained, as it increases in volume and variety.
It refers to the speed at which the data is generated. 90% of data in today’s world was created in the last two years alone. However, this velocity of information generated is bringing its own set of challenges. For some businesses, real-time analysis is crucial. Any delay will reduce the value of the data and its analysis for business. Spark is one such platform which helps analyse streaming data.
As time progresses, new ‘V’s get added to the definition of big data. But — volume, variety, veracity, and velocity — are the four essential constituents which differentiate data from big data. The algorithms which deal with big data, including machine learning algorithms, are optimised to leverage a different hardware infrastructure, that is utilised to handle big data.
To summarise, data science is an interdisciplinary field with an aim to derive actionable insights from data. Machine learning is a branch of artificial intelligence which is utilised by data science to teach the machines the ability to learn, without being explicitly
programmed. Volume, variety, veracity, and velocity are the four important constituents which differentiate big data from conventional data.