MLOps for Automated Training, Evaluation, Deployment and Monitoring — Part II

Bhagat Khemchandani
4 min readDec 9, 2021

--

Now that we understand the need for MLOps from Part I of this series, let’s dive into the architecture, high-level design, and services involved to realize the solution

Let us now look into -
i. ML orchestration tool which does the heavy lifting of training/evaluation, deployment of models, followed by
ii. Helper services on top of ML Orchestration Tool which helps standardize the data science workflows at my organization

i. Machine Learning Orchestration Tools

We evaluated a number of tools and frameworks used across the industry

  • Apache Airflow
    Airflow is a generic task orchestration platform. Use Airflow if you want the most full-featured, mature tool and you can dedicate time to learning how it works, setting it up, and maintaining it. Also, if you need a mature, broad ecosystem that can run a variety of different tasks.
  • KubeFlow
    Kubeflow focuses specifically on machine learning tasks, such as experiment tracking, hyperparameter tuning, and model deployment, and more out-of-the-box patterns for machine learning solutions.
    Kubeflow Pipelines is a separate component of Kubeflow that focuses on model deployment and CI/CD and can be used independently of Kubeflow’s other features. Kubeflow relies on Kubernetes and is likely more interesting to you if you’ve already adopted that.
  • MLFlow
    If you care more about tracking experiments or tracking and deploying models using MLFlow’s predefined patterns than about finding a tool that can adapt to your existing custom workflows.
  • AWS Sagemaker
    Amazon’s SageMaker is fully managed, ‘optimized’ for ML, and comes with lots of integrated tools such as notebook servers, Auto-ML, and monitoring within the AWS ecosystem.
    Managed and integrated does not mean easy to use though. SageMaker pipelines look almost identical to Kubeflow’s but their definitions require lots more detail (like everything on AWS) and do very little to simplify deployment for scientists.

While all of these tools have different focus points and different strengths, no tool can serve as a silver bullet. We might end up using a combination of these tools based on their capabilities.

As a starting point, we have selected Kubeflow as our tool of choice for the reasons stated below.

  • Kubeflow allows us to keep the entire application portable between cloud providers
  • Given we are already using Kubernetes to orchestrate workflows, Kubeflow comes out as a natural choice
  • The UI is very responsive. Status changes update in real-time. I can terminate a pipeline and see that it worked almost immediately
  • One key benefit would be the direct integration of containers and the UI. I can click on a stage in my pipeline and view the container logs
  • With Kubeflow, each pipeline step is isolated in its own container, which drastically improves the developer experience
  • Data scientists have easy access to the full compute power of the cluster from their notebooks
  • It’s also possible to run SageMaker jobs via Kubeflow components, allowing us to benefit from SageMaker’s tools while sticking to the Kubeflow framework.
  • It’s a very flexible and extensible framework due to it relying on Kubernetes to manage all code execution, resource management, and networking.

Note: Kubeflow lets you customize everything, which always comes at the expense of simplicity.

ii. Helper Services on top of Kubeflow

Remember our missions from Part I: Take the workload away from data scientists and Establish common practices with reproducibility.

Imagine if all the data scientists need to learn about Kubeflow and all the details needed to develop pipelines, and let's not forget these ML orchestrator tools might change from time to time or complement one another. This is a lot of work for Data scientists and is not related to Data science at all.

Also, these orchestration tools do not prescribe how the model development pipelines should look like. If we do not standardize the use of Kubeflow pipelines/ workflows, there will be a lack of consistency amongst the teams in terms of data preparation, model training, evaluation, deployment. This will very soon become a maintenance nightmare.

Helper services to the Rescue
To resolve the above two issues we decompose the pipeline execution in two parts —

  1. Define generic yet standard pipelines in kubeflow using kfp.
  2. Design a Frontend application and a model registry service that allows data scientists to trigger the above pipelines with zero knowledge of kubeflow. Think of a service that accepts parameters like — model git repository, gitsha, dataset repository(DVC), dataset-gitsha, hyperparameters, and many more, and uses them to trigger a pipeline on Kubeflow.

This makes the model training/evaluation/deployment process as simple as ordering a pizza online. Provide a few values and submit the job.

At the time of writing this, a lot of machine learning teams in my organization are waiting to get onboarded on the platform and start using it for their day-to-day experimentation, training, and deployment.

Intrigued? Please read Part III - Kubeflow Pipelines

--

--