From Jr Data Scientist/Machine learning to Full-stack Data Scientist/Machine learning engineer
The current outlook in the field of Data Science has changed significantly as compared to three or even two years ago. The learning curve should never end. So to thrive, one must develop the right skill set to fulfill the current industry expectations.
“Adaptability is about the powerful difference between adapting to cope and adapting to win.” — Max McKeown.
Let us look at the key elements that can assist us in moving from Jr Data Scientist / Machine learning to Full stack Data Scientist/Machine learning.
The Past Expectation
It is vital to understand the past responsibility to adapt to the current expectation of the industry. So in a nutshell, the day-to-day role of a Data Scientist in the past generally involved:
- The AI space was still relatively new (though not in academics) and many companies, startups were analyzing its application and valid use-case.
- The research was the primary focus. The caveat here was that this research many times was not directly in line with the core of the organization. So initially there was not so much credibility expected.
- Generally, companies used to blend the roles of a Data Scientist with a Data analyst or Data engineer. Again, due to the vagueness of AI enterprise application.
- Individuals also had a kind of similar dilemma. A lot of their research or work was not directly in line, practically not viable to be served as a product.
The Current Outlook
The democratization of AI has seen remarkable developments from companies and startups. Let us try to understand it,
- The industry now distinguishes the role of a Data Scientist, Machine Learning Engineer, Data Analyst, Data engineer, even MLops engineer.
- Businesses no longer allow research in the wild, as they know what use-case exactly they are tapping in. A clear mindset & similar discrete approach from an individual is also required.
- Every Research or POC must have a tangible and servable product.
Also Read: Career in Machine Learning
The thorough dissection of all the Roles
If we have to pick one area where the Businesses have excelled in AI space, it is undoubtedly the clear-expectation from all varieties of the Roles, which are in a nutshell:
- Data Scientist: A Data Scientist is a person who (generally from stats/maths background) uses a variety of means including AI to extract valuable information from data.
- A fundamental difference between Data Analyst & Data scientist is- the former generally rely on domain knowledge and manual old school methods to make sense of data on small to medium scale, whereas, the latter is responsible collecting, analyzing and interpreting data on a larger scale using wider means of tools like AI, SQL, old school manual ways, etc.,
- Domain knowledge is not a must but having is helpful.
- Primary job is to maintain and extract business contributing insights from data & not to develop the software or product.
- A Statistician or a Mathematician can become a good Data Scientist.
2. Machine Learning Engineer: A niche software engineer who develops a product or service based on AI.
- An ML engineer needs to have all the expertise of traditional software engineering along with knowledge of AI because he/she is eventually going to build software with AI at its heart.
- Primary job is not to extract data but to develop an AI tool which can perform the same job.
- A developer with good knowledge of machine learning/deep learning as well as software engineering can become a good Machine learning engineer.
3. Machine Learning Operation Engineer: A niche software engineer who maintains and automates the pipeline which is used by the ML system.
- Relatively new field inspired by DevOps. Though different from traditional DevOps roles.
- Unlike traditional software engineering, development for any product/software/service based on AI doesn’t stop at the completion of the building of software. It has to be updated regularly with new data, which is ‘Data-Drift’.
- Primary job includes all traditional DevOps work as well as maintaining/automating pipeline and Data-Drift
- A developer with good knowledge of machine learning/deep learning, software engineering & cloud technologies can become a good MlOps engineer.
For a new seeker or someone who is aiming to advance in his or her career, all these roles and expectations must be well understood. Given that companies are clearly distinguishing this role, it is expected that this will also be the case for individuals. Vague mindset is totally useless.
The stack of a Full stack Machine Learning system
Let us now move to the essential point. To become a Full stack Machine Learning Engineer, understanding the concept behind the stack is necessary.
What is Full stack?
- Similar to traditional software engineering, developing an AI-based system also needs a suite of tools. This complete suite can be referred to as Full Stack.
- The full stack is typically built using three building-blocks, Cloud technology, Governance technology and AI technology.
- There are multiple components for building an AI system across the three building-blocks. The list includes Configuration, Data collection transformation & verification, ML code (training & validation), Resource (process & machine) management tools, Serving infrastructure, Monitoring (can be clubbed with Data Drift). This list is not exhaustive, but it is certainly generic and may be modified as needed.
- So, to adhere to the well-performing ML system, we have to use the stack of tools to cover all the above mentioned components, sometimes even more than one for a single part.
What is the importance of the ability to design a Full stack system?
- As I mentioned above, today’s businesses do not allow research/POC without tangible sustainability of the product.
- I will be not exaggerating if I say the model training is not the most important part, in fact, I will rank it third or even fourth. The person who can design and maintain the stack becomes vital for the Company, because,
- If the same person who is going to train a model also maintains a Data pipeline (or contributes) then he/she can design it to cater to the exact needs.
- Understanding the Deployment infra will help to build a more performance centric.
- Understanding Serving infra will help in the speed and latency part (which is generally the highest cry for any ML system).
- Understanding Monitoring will help with Data Drift & in the long-run model performance.
- So, an individual knowing all this can make the whole pipeline more efficient and increase the performance. But above all, it saves cost for the company as now a single person can handle multiple roles, thus in turn, increase the value of the individual to the company.
So to summarize, it is essential not to just obsessed with model accuracy but obsessed with all key performance metrics- speed, latency, accuracy, infra needs, serving requests, etc.
Also Read: Machine Learning Project Ideas
Overview of how a full stack system works
Ideal ML System’s Lifecycle Overview
Pic credit: Microsoft MLOps
An Ideal ML Pipeline must follow the below concepts:
- Versioning of Project code
- Versioning of Data
- Versioning of Model
- Universal artifact store to store versioned assets
- Generic pipeline blueprint:
- Common discovery + experimentation policy
- Experiment tracking (like some metrics, results, performance)
- A common strategy to interconnect components of the pipeline
- Publish results
- A mechanism to easily reproduce, recreate, port
- Support for CI/CD
- Sufficient infra to support development as well as production
- Easy adaption for production and endpoints
- Scalable Serving infra to cater ever-increasing requests
- A one-time setting configuration with the stack
- Version Dataset with DVC.
- Strat tracking experiment with MLflow/Wandb.
- Log results, metrics, etc., with MLflow/Wandb on Universal Artifact store (Azure blob storage as backend).
- Log Model (or any related assets) as versioned assets with MLflow/Wandb on Universal Artifact store.
- Package individual components with Docker.
- Store package components with desired Docker repository
- Packaging and publishing must be done using the CI/CD.
- Scheduling automated model training based on continuous monitoring for Data Drift.
To remain relevant, resourceful, key team player, it is necessary to increase our knowledge tent. It will unquestionably help one to progress in any competitive environment.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.