MLOps — Dataset versioning using DVC and EFS in Kubeflow Pipelines

4 min readApr 27, 2022

Background

Machine Learning (ML) models often require large datasets of text documents or images that are used throughout the ML models development cycle: training, validation, and testing. Like source code, these datasets together with the information on how they are used for development need to be tracked and versioned to ensure reproducibility of the ML models and consistency of ML workflows across teams.

A common practice is to store the dataset on AWS S3/HDFS, sometimes with sub-folders for different versions.
This is not an effective approach (data is not immutable) and not one that has the consistency to be shared across different teams.

Instead, datasets are better tracked if versioned in the same way as source code, where a version unambiguously refers to the same data and one can always access the particular version of data that has been used to develop an ML model, therefore allowing to reproduce it.

This is where DVC comes in, enabling sharing and version control of large datasets and model files used by data science teams.

DVC in a nutshell

DVC (Data Version Control) is an open-source application for machine learning data and model version control. Think Git for data: the DVC syntax and workflow patterns are very similar to Git, making it intuitive to incorporate into existing repositories. Its features go beyond data and model versioning and include pipeline support or experiment tracking.

Let us now dive into how DVC is used in Kubeflow pipelines. Read more about DVC here

Repository structure

In order to use the MLOps platform in a consistent fashion, each repository must contain :

- data (contains all data points versioned using DVC. This folder is not present when you clone the repository but created when you pull data). This is managed by DVC.
- data.dvc (metadata file managed by DVC about the data folder)
- data-split (contains reference lists of subsets of the data e.g. training set for the 6.4K data subset or testing set for 100K data. The data itself is stored in the data folder)

Fetching datasets and Sharing them across pipeline stages

Before we trigger model training, we need to provision the datasets. This is done by a dedicated dataset preparation stage in kubeflow pipeline.

This stage clones the GitHub repository pulls the dataset using dvc pull, and makes it available for the training stage.
However, there is a catch, each stage/container in a pipeline may run on any of the nodes in the Kubeflow/Kubernetes cluster. This poses the challenge of sharing data across containers/nodes.

Amazon Elastic File System to the rescue

EFS enables us to build applications that need shared access to file data. EFS is (and always has been) simple and serverless: you simply create a file system, attach it to any number of EC2 instances, Lambda functions, or containers, and go about your work. EFS is a highly durable, scalable, and consistent file system.

It is for the above reasons I decided to leverage EFS for sharing data across containers.

As you can see in the above diagram, we are mounting volume in the first stage of the pipeline. This finds/creates a K8s PVC and later mounts it on data preparation and training containers.

Snippet to create/find a PVC in kubeflow pipeline

vop = dsl.VolumeOp(
    name=f"data-sharing-pvc",
    resource_name=f"data-sharing-pvc",
    storage_class="",
    size="10Gi",
    modes=dsl.VOLUME_MODE_RWM,
).set_display_name("Mount Volume")

Snippet to mount the above volume on any container/stage

prepare_data = dsl.ContainerOp(
    name="preparedata",
    image=,
    arguments=[
        "data",
        data_repository,
        data_repository_git_sha,
        data_directory,
        data_split_directory,
        data_split_name,
    ],
    pvolumes={"/mnt": vop.volume},
).set_display_name("Prepare Dataset")

This mounts the EFS Volume at /mnt on the dataset preparation container

Points to note when using EFS

Amazon EFS lifecycle management automatically manages cost-effective file storage for your file systems
You can use Bursting Throughput or Provisioned Throughput based on your requirements, data size, average file size e.t.c.
Overall throughput generally increases as the average I/O size increases
More here

Conclusion

DVC is a tool of choice for us as the training data is large files/binary files. This will be different for different use cases.

If you seek further implementation details, or need help with infrastructure setup, or have feedback to make this story better - feel free to post questions/comments.