Machine learning life-cycle is a bunch of processes that include Data Gathering, Data Cleaning, feature engineering, feature selection, model building, hyper-parameter tuning, validation, and model deployment.
While gathering data can take many forms such as manual surveys, data entry, web scrapping, or the data generated during an experiment, data cleaning is where the data is transformed into a standard form that can be used during other stages of the life-cycle.
The recent surge of machine learning has also welcomed a lot of businesses to adopt an AI-based solution for their mainstream products and therefore, a new chapter of AutoML has arrived in the market. It can be a great tool to quickly setup AI-based solutions, but there are still some concerning factors that need to be addressed.
Table of Contents
What is AutoML?
It is that set of tools that automate some parts of machine learning which is itself an automated process of generating predictions and classifications leading to actionable results. Though it can only automate feature engineering, model building, and sometimes deployment stages, most of the AutoML tools support multiple machine learning algorithms and almost as many evaluation metrics.
When such kind of tool is started, it runs the same dataset over all the algorithms, tests various metrics associated with the problem, and then presents a detailed report card. Let’s explore some famous tools available in the marketplace and are used extensively.
One of the leading solutions in AutoML is H2O.ai that offers industry-ready solutions to business problems coding nothing from scratch. This allows anyone from any domain to extract meaningful insights from the data without the need of having expertise in machine learning.
The H2O is an open-source that supports all widely used machine learning models and statistical approaches. It is built to deliver supper fast solutions as the data is distributed across clusters and then stored in a columnar format in memory, allowing parallel read operations.
Newer versions of this project also have GPU support, which makes it more fast and efficient. Let’s look at how this can be performed using Python (run the code in jupyter notebook for better understanding):
!pip install h2o # run this if you haven’t installed it
from h2o.automl import H2OAutoML
df = h2o.import_file() # Here provide the file path
y = ‘target_label’
x = df.remove(y)
X_train, X_test, X_validate = df.split_frame(ratios=[.7, .15])
model_obj = H2OAutoML(max_models = 10, seed = 10, verbosity=”info”, nfolds=0)
model_obj.train(x = x, y = y, training_frame = X_train, validation_frame=X_validate)
results = model_obj.leaderboard
This will store the results of all algorithms displaying their respective metrics depending upon the problem.
Read: Machine Learning Tools
This is fairly a new library launched this year, which supports a wide range of AutoML features with just a few lines of code. Be it processing missing values, transforming categorical data to model feedable format, hyper-parameter tuning, or even feature engineering, PyCaret automates all of this behind the scenes when you can focus more on data manipulation strategies.
It is more of a Python wrapper for all available machine learning tools and libraries such as NumPy, pandas, sklearn, XGBoost, etc. Let’s understand how you can perform classification problem using Pycaret:
!pip install pycaret # run this if you haven’t installed it
from pycaret.datasets import get_data
from pycaret.classification import *
df = get_data(‘diabetes’)
setting = setup(diabetes, target = ‘Class variable’)
compare_models() # This function simply displays the comparison of all algorithms!
selected_model = create_model() # pass the name of algorithm you want to create
final_model = finalize_model(selected_model)
save_model(final_model , ‘file_name’)
loaded = load_model(‘file_name’)
That’s it, you just created a transformation pipeline that performed the feature engineering, trained a model, and saved it!
We have looked upon two libraries that automate selecting features, model building, and tuning it to get the best results, but we haven’t discussed how the data cleaning can be automated. This process can be automated for sure, but it requires manual verification about whether the right data is passed or if the values make any sense or not.
More data is a plus point to the model building, but it should be quality data to get quality results. Google DataPrep is an intelligent data preparation tool offered as a platform as a service that allows visual data cleaning of the data, meaning you can change the data without coding even a single line and just selecting the options.
It offers an interactive GUI, which makes it super easy to select options to perform the functions you want to apply. The best part about this tool is that it will display all the changes that are done on the dataset in a side panel in the order they have been performed and any step can be changed. It helps in keeping a track of the changes. You will be prompted with suggestions to be made, which are mostly correct.
The resulting file can be exported to local storage or as this service is provided in Google Cloud Platform, you can directly take this file to any Google Storage bucket or BigQuery tables where you can perform machine learning tasks directly in the query editor. The major setback to this can be its recurring costs, it is not an open-source project and rather a full-fledged industry solution.
Can this replace Data Scientists?
Absolutely not! The AutoML is great and it can help the Data Scientist to speed up a particular life cycle, but expert advice is always needed. For instance, it will take much time to get the right model for a particular problem statement from an AutoML which runs all the algorithms than from an expert who will run it on specific algorithms that best suit the problem.
Data scientists will be required to validate the results from these types of automation and then provide a feasible solution to the businesses. The domain expert people will find this automation very useful as they might not have much experience in deriving insights from the data, but these tools will guide them in the best way.
If you want to master machine learning and learn how to train an agent to play tic tac toe, to train a chatbot, etc. check out upGrad’s Machine Learning & Artificial Intelligence PG Diploma course.