In the ever-evolving landscape of machine learning (ML), data stands as the cornerstone upon which triumphant models are built. However, as ML projects expand and encompass larger and more complex datasets, the challenge of efficiently managing and controlling data at scale becomes more pronounced. Breaking Down Conventional Approaches: The Copy-Paste Predicament In the world of data science, it’s commonplace for data scientists to extract subsets of data to their local environments for model training. This method allows for iterative experimentation, but it introduces challenges that hinder the seamless evolution of ML projects: Reproducibility Constraints: Traditional practices of copying and modifying data locally lack the version control and auditability crucial for reproducibility. Iterating on models with various data subsets becomes a daunting task. Inefficient Data Transfer: Regularly shuttling data between the central repository and local environments strains resources and time, especially when choosing different subsets of data for each training run. Limited Compute Power: Operating within a local environment hampers the ability to harness the full power of parallel computing, as well as the distributed prowess of systems like Apache Spark. In this talk, we will demonstrate: How to use lakeFS to version control your data when working with your data locally. How to use lakeFS without the need to copy data locally, and train your model at scale directly on the cloud. We will be leveraging a technology stack of: AWS S3 Databricks Delta Lake PyTorch MLflow
Session Summary
ML Data Version Control and Reproducibility at Scale
MLconf Online 2023
Iddo Avneri
Treeverse
VP, CS
Learn more »