Machine Learning is picking up the pace and has been a bone of contention for a very long period of time. Some very great algorithms and architectures in this domain have made it possible for the concept of Machine Learning to be applied in the practical and live world.
It is no more just a notion for research and has spread deep into useful application areas. And today, more than ever, there is a need to master the art of end-to-end pipeline for Machine Learning projects.
There is a growing interest in Machine Learning for a lot of people and there is an immense amount of resources available that can help you to understand the fundamentals of ML and AI. Many courses take you from learning some basic concepts to finally building some state of the art models.
But is that it? Do we really learn how to access the data and do we really see how to clean the data so that our ML model can extract useful features from it? And what about the deployment part? There are so many questions on similar lines that remain unanswered in our minds after we complete such courses and curriculums.
This problem arises due to a poor understanding of a complete end to end Machine Learning pipeline for any project. In this article, we will go through one such pipeline to understand what exactly needs to be done in order to get better results in a real-life scenario for any ML project.
One of the books that best shows this is the Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron.
This end to end pipeline can be divided into a few steps for better understanding, and those are:
- Understanding the problem statement
- Acquiring the required data
- Understanding the data
- Cleaning the data
- Selecting the best model for training
- Fine-tuning the hyperparameters
- Presenting the results
- Deploying and maintaining the system
To better understand the pipeline of any real-life Machine Learning project, we will use the popular example of the California House price prediction problem. We will discuss all the above points in relation to this problem statement. There might be some minor changes for different projects but overall the objective remains the same.
Understanding the problem statement
In order to build a good solution, one needs to understand the problem statement very clearly. You will most probably end up building and training a Machine Learning model but real-life application areas need much more than just the models. The model’s output should be matched with what exactly is needed by the end-user.
For this particular example, we are given a dataset of all the metrics in California like population, income, house prices, and others. The required output by the model is that it should be able to predict the pricing of the house given its other attributes like location, population, income, and others.
The important reason for this step is to exactly understand what needs to be done and exactly what kind of solution is needed. This is where the main brainstorming part is done for how the problem statement must be approached.
Acquiring the required data
Once you have understood the problem statement clearly and have decided to move forward with a Machine Learning approach to solve the problem, you should start searching for relevant data. Data is the most important ingredient of any Machine Learning project so you must carefully find and select the quality data only. The final performance of the ML models depends on the data that was used while training.
There are various sources to find data that can help understand the data distribution in real-life examples too. For our example, we can take the California House Price Prediction dataset from Kaggle. This data is in CSV format and so we will be using the Pandas library to load the dataset.
Understanding the data
It is a very important aspect of the ML solution to be able to understand the data that you are working with. This enables us to choose which algorithms or model architectures are better suited for the project. Before starting to look at the data in detail, it is a good practice to first split the dataset into train and test sets. This keeps the test set untouched and hence decreases the chances of overfitting to the test set. By doing this, you are eliminating the data snooping bias from the model.
There are various ways of splitting the datasets into these train and test sets. One of these is splitting it with a hardcoded percentage value. 90% train and 10% test is a common value in most of the cases.
After the splitting, you will have to visualize the train set in-depth to understand the data. The current dataset includes the latitude and longitude points and hence, it is quite helpful to use the scatter plot to look at the density according to the locations.
Finding the correlation between two attributes in the dataset is helpful to understand which attributes relate more to the required attribute. In this case, we need to find out which attribute is related more to the house prices in the dataset. This can easily be done in Scikit-Learn by using the corr() method. It returns a value for each attribute with respect to another one. So if you need to see the relations with respect to the house prices, this is the way that you can do it:
Here, it is visible that median_income is directly related to the house value and on the other hand latitude value is indirectly related to it.
Finally, you can also try to do some feature engineering by combining some attributes together. For example, total rooms_per_household can be much more informative than the total_rooms or household values individually.
Cleaning the data
In this step, you prepare the data for the Machine Learning project. It is the most time consuming and important step of the entire pipeline. The performance of the model majorly depends on how well you prepare the data. Usually, it is a good practice to write functions for this purpose as it will allow you to use those functions whenever needed and the same functions can be used in the production line to prepare the new data for predictions.
One of the most encountered problems in real data is the missing values for a few entries in the dataset. There are a few ways of handling it. You can directly delete the entire attribute but this is not very good for the model. You can get rid of the row which has one missing value. Another way which is mostly used is to set the missing value to some other value like zero or the arithmetic mean of the entire column if it is a numeric value.
For categorical values, it is better to represent them by numbers and encoding them into a one-hot encoding so that it is easier for the model to work on it. Scikit-Learn also provides the OneHotEncoder class so that we can easily convert categorical values into one-hot vectors.
Another thing that you have to look after is the feature scaling. There might be some attributes whose value ranges are very drastic. So it is better to scale them to a standard scale so that the model can easily work with those values and perform better.
Also read about: Machine Learning Engineer Salary in India
Selecting the best model for training
After completing all the data cleaning and feature engineering, the next step becomes quite easy. Now, all you have to do is train some promising models on the data and find out the model that gives the best predictions. There are a few ways that help us select the best model.
Our example of the California house price prediction is a regression problem. This means that we have to predict a value from a range of numbers which is, in this case, the house price.
The first step here is to train a few models and test them on the validation set. You should not use the test set here as it will lead to overfitting on the test set and eventually the model will have a very low regularization. From those models, the model with good training accuracy and validation accuracy should be chosen most of the time. It may also depend on the use case as some tasks require different configurations than others.
As we have already cleaned up the data and the preprocessing functions are ready, it is very easy to train different models in three to four lines of code using some frameworks like Scikit-Learn or Keras. In Scikit-Learn we also have an option of cross-validation which helps a lot to find good hyperparameters for models like decision trees.
Fine-tuning the hyperparameters
After having a few models shortlisted there comes a need for fine-tuning the hyperparameters to unleash their true potential. There are many ways to achieve this too. One of which is that you can manually change the hyperparameters and train the models again and again till you get a satisfactory result. Here you can clearly see the problem that you cannot possibly check out as many combinations as an automated task would. So here comes in some good methods to automate this stuff.
Grid Search is a wonderful feature provided by Scikit-Learn in the form of a class GridSearchCV where it does the cross-validation on its own and finds out the perfect hyperparameter values for better results. All we have to do is mention which hyperparameters it has to experiment with. It is a simple but very powerful feature.
Randomized search is another approach that can be used for a similar purpose. Grid Search works well when there is a small space of hyperparameters to be experimented with but when there’s a large number of hyperparameters, it is better to use the RandomizedSearchCV. It tries random hyperparameters and comes up with the best values it has seen throughout.
Last but not least, is the approach of Ensemble Learning. Here we can use multiple models to give their respective predictions and at last, we can choose the final prediction as to the average of all. This is a very promising method and wins a lot of competitions on Kaggle.
After fine-tuning all the hyperparameters for the final model, you can then use the model to make predictions on the test set. Here we can evaluate how good the model is doing on the test set. Remember that you shouldn’t fine-tune your model after this to increase the accuracy on the test set as it will lead to overfitting on the samples of the test set.
Presenting the results
Once the best model is selected and the evaluation is done, there is a need to properly display the results. Visualization is the key to making better Machine Learning projects as it is all about data and understanding the patterns behind it. The raw numeric results can sound good to people already familiar with this domain but it is very important to visualize it on graphs and charts as it makes the project appealing and everyone can get a clear picture of what actually is happening in our solution.
Deploying and maintaining the system
Most of the learners reach this stage of the pipeline and face tremendous issues while trying to deploy the project for application in a real-life scenario. It is quite easy to build and train models in a Jupyter Notebook but the important part is to successfully save the model and then use it in a live environment.
One of the most common problems faced by ML engineers is that there is a difference in the data that is received live and the data that they have trained the model on. Here we can use the preprocessing functions that we had built while creating the pipeline for training our models.
There are two types of Machine Learning models that can be deployed: An online model and an offline model. The online model is the one that keeps learning from the data that it is receiving in real-time. Offline models do not learn from new samples and have to be updated and maintained properly if there is a change in the kind of data received by it. So there needs to be proper maintenance for both types of models.
While deploying Machine learning models, they need to be wrapped in a platform for the users to have ease in interacting with them. The options are wide, we can wrap it in a web app, android app, Restful API, and many more. Basic knowledge of building such apps or APIs is a huge plus point. You should be able to deploy NodeJS or Python apps on cloud services like Google Cloud Platforms, Amazon Web Services, or Microsoft Azure.
If you are not comfortable with some frameworks like Django or Flask, you can try out Streamlit which allows you to deploy a python code in the form of a web app in just a few lines of additional code. There are various such libraries and frameworks which can be explored.
To conclude this entire article, I would say that Machine Learning projects are quite different from other traditional projects in terms of a pipeline and if you manage to master this pipeline, everything else becomes much easier.
Some of the most important steps of this end to end pipeline that many of the beginners tend to neglect are data cleaning and model deployment. If these steps are taken care of, the rest of the part is just like any other project.
Following these steps and having a pipeline set for projects helps you have a clear vision about the tasks, and debugging the issues becomes more manageable. So I suggest that you go through these steps and try implementing an end to end Machine Learning project of your own using this checklist. Pick up a problem statement, find the dataset, and move on to have fun on your project!
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.