MLOps for Automated Training, Evaluation, Deployment and Monitoring
This is Part I of the series illustrating how I automated the training, evaluation, and deployment of machine learning models for one of my clients and developed a fully functional framework around Kubeflow and MLOps
Part I - Here I will take you through the problem statement, need of MLOps, and 10000-foot view of the project goal and the Model lifecycle
Part II - This one focuses on the architecture, high-level design, and services involved to realize the solution
Part III - Covers Kubeflow Pipelines - Training & Validation Pipeline, Evaluation of Pre-Trained models, Deployment of a trained model
Part IV - Orchestrator for Kubeflow Pipelines, Authentication(SSO) setup with Kubeflow, Istio, Dex, and Cognito
Part V - Dataset versioning using DVC, Using EFS for sharing datasets across stages
Part VI - Automated model deployment using Kserve/Helmcharts for serving models and packaging models on the fly
Part VII - Carbon Footprint/Emissions monitoring with Kubernetes, Newrelic, and using MLFlow for tracking model training params, metrics
Problem Statement
Machine Learning (ML) models built by data scientists represent a small fraction of the components that comprise an enterprise production deployment workflow. To operationalize ML models, data scientists are required to work closely with multiple other teams such as business, engineering, and operations. This represents organizational challenges in terms of communication, collaboration, and coordination.
MLOps
MLOps, or machine learning operations, are the infrastructure, methods, and practices used to streamline the machine learning life cycle from model design and data preparation to monitoring of model in production.
MLOps took many principles from DevOps. Both DevOps and MLOps encourage and facilitate collaboration between people who develop (software engineers and data scientists), people who manage infrastructure, and other stakeholders. Both emphasize process automation in continuous development so that speed and efficiency are maximized.
Unlike traditional Software development, Machine Learning is not only influenced by changes in its code. Data is the other critical input that’ll need to be managed, as will parameters, metadata, logs, and finally, the model.
In addition to solving an organizational challenge, MLOps is addressing a technical one: how to make the process reproducible and auditable. By using similar tools and by following common workflows, teams can more easily automate the delivery of an end-to-end ML system ensuring its reproducibility. This, in turn, increases the trust in the ML model predictions and facilitate its audit.
I don’t intend to describe MLOps in depth. There are plenty of resources on the web. Some references are -
Continuous Delivery for Machine Learning
Introduction to MLOps by O’Reilly media.
https://martinfowler.com/articles/cd4ml.html
You can also aim for an online specialization course: Machine Learning Engineering for Production (MLOps) Specialization
Project Mission
At the risk of oversimplification, the mission of my project was to
- Take the workload away from data scientists to operationalize Machine Learning (automatic model training, deployment, etc.)
- Establish common practices with reproducibility of model training and robustness by design
- Automate most of the infrastructure and builds required during the Model Development lifecycle
Use case Overview
As with any project, data scientist teams work to accomplish a task e.g. Extract authors from research papers, find references, citations from manuscripts, etc. For any task, there are multiple model architectures to be evaluated, and the most suited model based on objectives and metrics is deployed later.
The figure above illustrates the lifecycle of model development, deployment, and maintenance. It is the responsibility of MLOps to facilitate the Tools, Pipelines, and underlying infrastructure in order to make the End-to-End process seamless.
If this gets you intrigued, read through Part II that focuses on the architecture, high-level design, and services involved to realize the solution