SlideShare a Scribd company logo
DVC
Version Control System for Machine Learning Projects
Francesco Casalegno
What is DVC?
2
Francesco Casalegno – DVC
● Simple command line Git-like experience.
○ Does not require installing and maintaining any databases.
○ Does not depend on any proprietary online services.
● Management and versioning of datasets and ML models.
○ Data is saved in S3, Google cloud, Azure, SSH server, HDFS, or even local HDD RAID.
● Makes projects reproducible and shareable; answers questions on how a model was built.
● Helps manage experiments with Git tags/branches and metrics tracking.
“DVC aims to replace spreadsheet and document sharing tools (such as Excel or Google Docs)
which are being used frequently as both knowledge repositories and team ledgers.
DVC also replaces both ad-hoc scripts to track, move, and deploy different model versions; as
well as ad-hoc data file suffixes and prefixes.”
What is DVC?[ref]
3
Francesco Casalegno – DVC
● dvc and git
○ git: version code, small files
○ dvc: version data, intermed. results, models
○ dvc uses git, w/o storing file content in repo
● versioning and storing large files
○ dvc save info on data in special .dvc files
○ .dvc files can then be versioned using git
○ actual storage happens w remote storage
○ dvc supports many remote storage types
● dvc main features
○ data versioning
○ data access
○ data pipelines
What is DVC?
4
Getting Started
5
Francesco Casalegno – DVC
Install
6
$ pip install dvc
● Depending on remote storage you will use, you may want to install specific dependencies.
$ pip install dvc[s3] # support Amazon S3
$ pip install dvc[ssh] # support ssh
$ pip install dvc[all] # all supports
● Install as a python package.
Francesco Casalegno – DVC
$ mkdir ml_project & cd ml_project
$ git init
Initialization
● We must work inside a Git repository. If it does not exist yet, we create and initialize one.
7
$ dvc init
$ git status -s
A .dvc/.gitignore
A .dvc/config
A .dvc/plots/confusion.json
A .dvc/plots/default.json
A .dvc/plots/scatter.json
A .dvc/plots/smooth.json
A .dvcignore
$ git commit -m "Initialize dvc project"
● Initializing a DVC project creates and automatically git add a few important files.
Tell Git not to track .dvc/cache and .dvc/tmp
TOML file with configurations for
- dvc remote storage — name, url
- dvc cache – reflink/copy/hardlink, location, ...
- ...
Tell dvc what not to track (empty for now)
Plot templates (visualize & compare metrics)
Data Versioning
8
v1.0 v2.0 v3.0
Francesco Casalegno – DVC
data
├── train
│ ├── dogs # 500 pictures
│ └── cats # 500 pictures
└── validation
├── dogs # 400 pictures
└── cats # 400 pictures
● This folder contains 43 MB of JPG figures organized in a hierarchical fashion.
Getting some data
● Let’s download some data to train and validate a “cat VS dog” CNN classifier.
We use dvc get, which is like wget to download data/models from a remote dvc repo.
9
$ dvc get https://github.com/iterative/dataset-registry tutorial/ver/data.zip
$ unzip data.zip & rm -f data.zip
inflating: data/train/cats/cat.001.jpg
...
Francesco Casalegno – DVC
Start versioning data
● Tracking data with DVC is very similar to tracking code with git.
10
$ dvc add data/
100% Add|██████████|1/1 [00:30, 30.51s/file]
To track the changes with git, run:
git add .gitignore data.dvc
$ git add .gitignore data.dvc
$ git commit -m "Add first version of data/"
$ git tag -a "v1.0" -m "data v1.0, 1000 images"
Tell git not to track the data/ directory
DVC-generated, contains hash to track data/ :
outs:
- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
path: data
● Quite a few things happened when calling dvc add :
○ The hash of the content of data/ was computed and added to a new data.dvc file
○ DVC updates .gitignore to tell Git not to track the content of data/
○ The physical content of data/ —i.e. the jpg images— has been moved to a cache
(by default the cache is located in .dvc/cache/ but using a remote cache is possible!)
○ The files were linked back to the workspace so that it looks like nothing happened
(the user can configure the link type to use: hard link, soft link, reflink, copy)
→ human readable, can be versioned with git!
Francesco Casalegno – DVC
● To track the changes in our data with dvc, we follow the same procedure as before.
Make changes to tracked data (add)
11
● Let’s download some more data for our “cat VS dog” dataset.
Running dvc diff will confirm that dvc is aware the data has changed!
$ dvc get https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip
$ unzip new-labels.zip & rm -f new-labels.zip
inflating: data/train/cats/cat.501.jpg
...
$ dvc diff
Modified:
data/
$ dvc add data/
$ git diff data.dvc
outs:
-- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
+- md5: 21060888834f7220846d1c6f6c04e649.dir
path: data
$ git commit -am "New version of data/ with more training images"
$ git tag -a "v2.0" -m "data v2.0, 2000 images"
Francesco Casalegno – DVC
● To fix this mismatch we simply call dvc checkout.
This reads the cache and updates the data in the
workspace based on the current *.dvc files.
Switch between versions
12
● To switch version, first run git checkout.
This affects data.dvc but not workspace files in data/ !
$ git checkout v1.0
$ dvc diff
Modified:
data/
$ dvc checkout
M data/
$ dvc status
Data and pipelines are up to date.
Working with Storages
13
Francesco Casalegno – DVC
$ mkdir -p ~/tmp/dvc_storage
$ dvc remote add --default loc_remote ~/tmp/dvc_storage
Setting 'loc_remote' as a default remote.
$ git add .dvc/config
$ git commit -m "Configure remote storage loc_remote"
● Many remote storages are supported (Google Drive, Amazon S3, Google Cloud, SSH, HDFS, HTTP, …)
But we (as for Git) nothing prevents us to use a “local remote”!
● A remote storage is for dvc, what a GitHub is for git:
○ push and pull files from your workspace to the remote
○ easy sharing between developers
○ safe backup should you ever do a terrible mistake à la rm -rf *
Configure remote storage
14
DVC-generated, contains remote storage config
[core]
remote = loc_remote
['remote "loc_remote"']
url = /root/tmp/dvc_storage
Francesco Casalegno – DVC
$ dvc push
1800 files pushed
$ rm -rf .dvc/cache data
$ dvc pull # update .dvc/cache with contents from remote
1800 files fetched
$ dvc checkout # update workspace, linking data from .dvc/cache
A data/
● Now, even if all the data is deleted from our workspace and cache, we can download it with dvc pull.
This is pretty much like git pull.
● Running basically dvc push uploads the content of the cache to the remote storage.
This is pretty much like git push.
Storing, sharing, retrieving from storage
15
Francesco Casalegno – DVC
$ dvc list https://github.com/iterative/dataset-registry
README.md
get-started/
tutorial/
...
$ dvc get https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip
● When working outside of a DVC project —e.g. in automated ML model deployment— use dvg get
● First, we can explore the content of a DVC repo hosted on a Git server.
Access data from storage
16
● Note. For all these commands we can specify a git revision (sha, branch, or tag) with --rev <commit>.
● When working inside of another DVC project, we want to keep the connection between the projects.
In this way, others can know where the data comes from and whether new versions are available.
dvc import is like dvc get + dvc add, but the resulting data.dvc also includes a ref to the source repo!
$ dvc import https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip
$ git add new-labels.zip.dvc .gitignore
$ git commit -m "Import data from source"
Francesco Casalegno – DVC
● We can build a DVC project dedicated only to tracking and versioning datasets and models.
The repository would have all the metadata and history of changes in the different datasets.
● This a data registry, a middleware between ML projects and cloud storage.
This introduces quite a few advantages.
○ Reusability — reproduce and organize feature stores with a simple dvc get / import
○ Optimization — track data shared by multiple projects centralized in a single location
○ Data as code — leverage Git workflow such as commits, branching, pull requests, CI/CD …
○ Persistence — a DVC registry-controlled remote storage improves data security
● But versioning large data files for data science is great, is not all DVC can do:
DVC data pipelines capture how is data filtered, transformed, and used to train models!
Data Registries
17
Data Pipelines
18
Francesco Casalegno – DVC
● With dvc add we can track large files—this includes files such as: trained models, embeddings, etc.
However, we also want to track how such files were generated for reproducibility and better tracking!
Motivation
The following is an example of a typical ML pipeline. Its structure is a DAG (direct acyclic graph).
19
data.xml
train.tsv
test.tsv
train.pkl
test.pkl
model.pkl
scores.json
prc.json
Prepare
train-test split
◾ seed
◾ split
prepare.py
Featurize
TF-IDF embed
◾ max_feats
◾ n_grams
featurize.py
Train
RandomForest
◾ seed
◾ n_estimators
train.py
Evaluate
PR curve, AUC
evaluate.py
Stage Name
◾ parameter
script
input
output
Francesco Casalegno – DVC
→ Advantages of Option B
1. outputs are automatically tracked (i.e. saved in .dvc/cache)
2. pipeline stages with parameters names are saved in dvc.yaml
3. deps, params, outs are all hashed and tracked in dvc.lock
4. like a Makefile, can reproduce by dvc run prepare—re-run only if deps changed!
● Option A: run pipeline stages, then track output artifacts with dvc add
$ python src/prepare.py data/data.xml
$ dvc add data/prepared/train.tsv data/prepared/train.tsv
Tracking ML Pipelines
20
$ dvc run -n prepare 
-p prepare.seed 
-p prepare.split 
-d src/prepare.py 
-d data/data.xml 
-o data/prepared 
python src/prepare.py data/data.xml
● Option B: run pipeline stage and track them together with all dependencies with dvc run
stage name
parameters — read from params.yaml
dependencies (including script!)
outputs to track
prepare:
split: 0.20
seed: 20170428
featurize:
max_feats: 500
ngrams: 1
...
Francesco Casalegno – DVC
$ tree
.
├── data
│ ├── data.xml
│ └── data.xml.dvc
├── params.yaml
└── src
├── evaluate.py
├── featurization.py
├── prepare.py
├── requirements.txt
└── train.py
$ mkdir nlp_project & cd nlp_project
$ git init & dvc init & git commit -m "Init dvc repo"
● Let’s create a DVC repo for an NLP project.
Example
21
● Then we download some data + some code to prepare the data and train/evaluate a model
$ dvc get https://github.com/iterative/dataset-registry get-started/data.xml 
-o data/data.xml
$ dvc add data/data.xml
$ git add data/.gitignore data/data.xml.dvc & git commit -m "Add data, first version"
$ wget https://code.dvc.org/get-started/code.zip
$ unzip code.zip & rm -f code.zip
prepare:
split: 0.20
seed: 20170428
featurize:
max_features: 500
ngrams: 1
train:
seed: 20170428
n_estimators: 50
YAML file with params for all the stages
pipeline steps
Francesco Casalegno – DVC
$ dvc run -n prepare 
-p prepare.seed 
-p prepare.split 
-d src/prepare.py 
-d data/data.xml 
-o data/prepared 
python src/prepare.py data/data.xml
$ git add data/.gitignore dvc.yaml dvc.lock
● Let’s run the prepare stage.
Example
22
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
- src/prepare.py
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- path: data/data.xml
md5: a304afb96060aad90176268345e10355
- path: src/prepare.py
md5: 285af85d794bb57e5d09ace7209f3519
params:
params.yaml:
prepare.seed: 20170428
prepare.split: 0.2
outs:
- params: data/prepared
md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
Describe data pipelines, similar to how
Makefiles work for building software.
Matches the dvc.yaml file.
Created and updated by DVC
commands like dvc run.
Describes latest pipeline state for:
1. track intermediate and final
artifacts (like a .dvc file)
2. allow DVC to detect when stage
defs or dependencies changed,
triggering re-run.
● Note: dependencies and artifacts are automatically tracked, no need to dvc add them!
Francesco Casalegno – DVC
$ dvc run -n featurize 
-p featurize.max_features 
-p featurize.ngrams 
-d src/featurize.py 
-d data/prepared 
-o data/features 
python src/featurization.py data/prepared data/features
$ git add data/.gitignore dvc.yaml dvc.lock
● Then we run the featurize and train stages in the same way.
Example
23
$ dvc run -n train 
-p train.seed 
-p train.n_estimators 
-d src/train.py 
-d data/features
-o model.pkl
python src/train.py data/features model.pkl
$ git add data/.gitignore dvc.yaml dvc.lock
Francesco Casalegno – DVC
$ dvc run -n evaluate 
-d src/evaluate.py 
-d model.pkl 
-d data/features 
--metrics-no-cache scores.json 
--plots-no-cache prc.json 
python src/evaluate.py model.pkl data/features scores.json prc.json
$ git add dvc.yaml dvc.lock
● And finally we run the evaluation stage.
Example
24
Declare output metrics file.
A special kind of output file (-o), must be
JSON and can be used to make comparisons
across experiments in a tabular form.
E.g. here it contains data for AUC score.
The -no-cache prevents DVC to store the file
in cache.
Declare output plot file.
A special kind of output file (-o), must be
JSON and can be used to make comparisons
across experiments in a plot form.
E.g. here it contains data for ROC curve plot.
The -no-cache prevents DVC to store the file
in cache.
Francesco Casalegno – DVC
$ dvc dag
+-------------------+
| data/data.xml.dvc |
+-------------------+
*
*
*
+---------+
| prepare |
+---------+
*
*
*
+-----------+
| featurize |
+-----------+
** **
** *
* **
+-------+ *
| train | **
+-------+ *
** **
** **
* *
+----------+
| evaluate |
+----------+
Plot dependency graphs
25
Francesco Casalegno – DVC
$ dvc repro train
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Stage 'train' didn't change, skipping
Data and pipelines are up to date.
● dvc repro regenerate data pipeline results, by restoring the DAG defined by stages listed in dvc.yaml.
This compares file hashes with dvc.lock to re-run only if needed. This is like make in software builds.
Reproducing Pipelines
26
● Case 1: nothing changed, re-running pipeline stages is skipped.
● Case 2: a dependency changed, pipeline stages are re-run if needed.
$ sed -i -e "s@max_features: 500@max_features: 1500@g" params.yaml
$ dvc repro train
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'featurize' with command:
python src/featurization.py data/prepared data/features
Updating lock file 'dvc.lock'
Running stage 'train' with command:
python src/train.py data/features model.pkl
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock
Francesco Casalegno – DVC
$ dvc plots diff -x recall -y precision
file:///Users/dvc/example-get-started/plots.html
$ dvc params diff
Path Param Old New
params.yaml featurize.max_features 500 1500
● dvc params diff rev_1 rev_2 shows how parameters differ in two different git revisions/tags.
Without arguments, it shows how they differ in workspace vs. last commit.
Comparing experiments
27
$ dvc params diff
Path Metric Value Change
scores.json auc 0.61314 0.07139
● dvc metrics diff rev_1 rev_2 does the same for metrics.
● dvc plots diff rev_1 rev_2 does the same for plots.
Shared Development Server
28
Francesco Casalegno – DVC
● Disk space optimization
Avoid having 1 cache per user!
● Use DVC as usual
- Each dvc add or dvc run moves
data to the shared external cache!
- Each dvc checkout links required
data to the workspace!
Shared Development Server
29
● See here for implementation details,
but basically it’s not too difficult:
$ mkdir -p path_shared_cache/
$ mv .dvc/cache/* path_shared_cache/
$ dvc cache dir path_shared_cache/
$ dvc config cache.shared group
$ git commit -m "config shared cache"
Conclusions
30
Francesco Casalegno – DVC
Conclusions
31
● DVC is a version control system for large ML data and artifacts.
● DVC integrates with Git through *.dvc and dvc.lock files, to version files and pipelines, respectively.
● DVC repos can work as data registries, i.e. a middleware between cloud storage and ML projects
● To track raw ML data files, use dvc add—e.g. for input dataset.
● To track intermediate or final results of a ML pipeline, use dvc run—e.g. for model weights, dataset.
● Consider using a shared development server with a unified, shared external cache

More Related Content

PDF
Introduction to Spark with Python
PDF
Data Versioning and Reproducible ML with DVC and MLflow
PDF
Optical Designs for Fundus Cameras
PDF
What is DBT - The Ultimate Data Build Tool.pdf
PPTX
Social media powerpoint presentation
PPTX
Gender and Age Detection using OpenCV.pptx
PDF
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
PPTX
MLOps.pptx
Introduction to Spark with Python
Data Versioning and Reproducible ML with DVC and MLflow
Optical Designs for Fundus Cameras
What is DBT - The Ultimate Data Build Tool.pdf
Social media powerpoint presentation
Gender and Age Detection using OpenCV.pptx
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
MLOps.pptx

What's hot (20)

PDF
MLOps for production-level machine learning
PPTX
From Data Science to MLOps
PDF
MLFlow: Platform for Complete Machine Learning Lifecycle
PDF
Ml ops intro session
PDF
What is MLOps
PDF
Introduction to MLflow
PDF
Exploring the power of OpenTelemetry on Kubernetes
PDF
MLOps Using MLflow
PDF
The A-Z of Data: Introduction to MLOps
PDF
Simplifying Model Management with MLflow
PDF
MLOps with Kubeflow
PDF
Introduction to CICD
PDF
Open core summit: Observability for data pipelines with OpenLineage
PPTX
Pythonsevilla2019 - Introduction to MLFlow
PPTX
MLOps in action
PDF
Managing the Complete Machine Learning Lifecycle with MLflow
PDF
MLflow: A Platform for Production Machine Learning
PDF
Machine Learning using Kubeflow and Kubernetes
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
PPTX
Azure DevOps
MLOps for production-level machine learning
From Data Science to MLOps
MLFlow: Platform for Complete Machine Learning Lifecycle
Ml ops intro session
What is MLOps
Introduction to MLflow
Exploring the power of OpenTelemetry on Kubernetes
MLOps Using MLflow
The A-Z of Data: Introduction to MLOps
Simplifying Model Management with MLflow
MLOps with Kubeflow
Introduction to CICD
Open core summit: Observability for data pipelines with OpenLineage
Pythonsevilla2019 - Introduction to MLFlow
MLOps in action
Managing the Complete Machine Learning Lifecycle with MLflow
MLflow: A Platform for Production Machine Learning
Machine Learning using Kubeflow and Kubernetes
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Azure DevOps
Ad

Similar to DVC - Git-like Data Version Control for Machine Learning projects (20)

PDF
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
PDF
DevOops & How I hacked you DevopsDays DC June 2015
PDF
LasCon 2014 DevOoops
PDF
Gradle como alternativa a maven
PDF
PyData Berlin 2018: dvc.org
PPTX
[20200720]cloud native develoment - Nelson Lin
PDF
[EXTENDED] Ceph, Docker, Heroku Slugs, CoreOS and Deis Overview
PPTX
drupal ci cd concept cornel univercity.pptx
PDF
Improving Apache Spark Downscaling
PDF
Perspectives on Docker
PPTX
CI-CD WITH GITLAB WORKFLOW
PDF
Continuous Delivery w projekcie Open Source - Marcin Stachniuk - DevCrowd 2017
PPTX
Data science workflows: from notebooks to production
PDF
Docker primer and tips
PDF
Microservices DevOps on Google Cloud Platform
PDF
Cloud Driven Development: a better workflow, less worries, and more power
PDF
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
PDF
DCEU 18: Developing with Docker Containers
PDF
Developing with-devstack
PDF
Digital Forensics and Incident Response in The Cloud Part 3
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
DevOops & How I hacked you DevopsDays DC June 2015
LasCon 2014 DevOoops
Gradle como alternativa a maven
PyData Berlin 2018: dvc.org
[20200720]cloud native develoment - Nelson Lin
[EXTENDED] Ceph, Docker, Heroku Slugs, CoreOS and Deis Overview
drupal ci cd concept cornel univercity.pptx
Improving Apache Spark Downscaling
Perspectives on Docker
CI-CD WITH GITLAB WORKFLOW
Continuous Delivery w projekcie Open Source - Marcin Stachniuk - DevCrowd 2017
Data science workflows: from notebooks to production
Docker primer and tips
Microservices DevOps on Google Cloud Platform
Cloud Driven Development: a better workflow, less worries, and more power
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
DCEU 18: Developing with Docker Containers
Developing with-devstack
Digital Forensics and Incident Response in The Cloud Part 3
Ad

More from Francesco Casalegno (8)

PDF
Ordinal Regression and Machine Learning: Applications, Methods, Metrics
PDF
Recommender Systems
PDF
Markov Chain Monte Carlo Methods
PDF
Hyperparameter Optimization for Machine Learning
PDF
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
PDF
Smart Pointers in C++
PDF
[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...
PDF
C++11: Rvalue References, Move Semantics, Perfect Forwarding
Ordinal Regression and Machine Learning: Applications, Methods, Metrics
Recommender Systems
Markov Chain Monte Carlo Methods
Hyperparameter Optimization for Machine Learning
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Smart Pointers in C++
[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...
C++11: Rvalue References, Move Semantics, Perfect Forwarding

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Computer network topology notes for revision
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Quality review (1)_presentation of this 21
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Business Analytics and business intelligence.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Lecture1 pattern recognition............
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction-to-Cloud-ComputingFinal.pptx
Computer network topology notes for revision
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Database Infoormation System (DBIS).pptx
Fluorescence-microscope_Botany_detailed content
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Qualitative Qantitative and Mixed Methods.pptx
Foundation of Data Science unit number two notes
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Quality review (1)_presentation of this 21
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Business Analytics and business intelligence.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Lecture1 pattern recognition............
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf

DVC - Git-like Data Version Control for Machine Learning projects

  • 1. DVC Version Control System for Machine Learning Projects Francesco Casalegno
  • 3. Francesco Casalegno – DVC ● Simple command line Git-like experience. ○ Does not require installing and maintaining any databases. ○ Does not depend on any proprietary online services. ● Management and versioning of datasets and ML models. ○ Data is saved in S3, Google cloud, Azure, SSH server, HDFS, or even local HDD RAID. ● Makes projects reproducible and shareable; answers questions on how a model was built. ● Helps manage experiments with Git tags/branches and metrics tracking. “DVC aims to replace spreadsheet and document sharing tools (such as Excel or Google Docs) which are being used frequently as both knowledge repositories and team ledgers. DVC also replaces both ad-hoc scripts to track, move, and deploy different model versions; as well as ad-hoc data file suffixes and prefixes.” What is DVC?[ref] 3
  • 4. Francesco Casalegno – DVC ● dvc and git ○ git: version code, small files ○ dvc: version data, intermed. results, models ○ dvc uses git, w/o storing file content in repo ● versioning and storing large files ○ dvc save info on data in special .dvc files ○ .dvc files can then be versioned using git ○ actual storage happens w remote storage ○ dvc supports many remote storage types ● dvc main features ○ data versioning ○ data access ○ data pipelines What is DVC? 4
  • 6. Francesco Casalegno – DVC Install 6 $ pip install dvc ● Depending on remote storage you will use, you may want to install specific dependencies. $ pip install dvc[s3] # support Amazon S3 $ pip install dvc[ssh] # support ssh $ pip install dvc[all] # all supports ● Install as a python package.
  • 7. Francesco Casalegno – DVC $ mkdir ml_project & cd ml_project $ git init Initialization ● We must work inside a Git repository. If it does not exist yet, we create and initialize one. 7 $ dvc init $ git status -s A .dvc/.gitignore A .dvc/config A .dvc/plots/confusion.json A .dvc/plots/default.json A .dvc/plots/scatter.json A .dvc/plots/smooth.json A .dvcignore $ git commit -m "Initialize dvc project" ● Initializing a DVC project creates and automatically git add a few important files. Tell Git not to track .dvc/cache and .dvc/tmp TOML file with configurations for - dvc remote storage — name, url - dvc cache – reflink/copy/hardlink, location, ... - ... Tell dvc what not to track (empty for now) Plot templates (visualize & compare metrics)
  • 9. Francesco Casalegno – DVC data ├── train │ ├── dogs # 500 pictures │ └── cats # 500 pictures └── validation ├── dogs # 400 pictures └── cats # 400 pictures ● This folder contains 43 MB of JPG figures organized in a hierarchical fashion. Getting some data ● Let’s download some data to train and validate a “cat VS dog” CNN classifier. We use dvc get, which is like wget to download data/models from a remote dvc repo. 9 $ dvc get https://github.com/iterative/dataset-registry tutorial/ver/data.zip $ unzip data.zip & rm -f data.zip inflating: data/train/cats/cat.001.jpg ...
  • 10. Francesco Casalegno – DVC Start versioning data ● Tracking data with DVC is very similar to tracking code with git. 10 $ dvc add data/ 100% Add|██████████|1/1 [00:30, 30.51s/file] To track the changes with git, run: git add .gitignore data.dvc $ git add .gitignore data.dvc $ git commit -m "Add first version of data/" $ git tag -a "v1.0" -m "data v1.0, 1000 images" Tell git not to track the data/ directory DVC-generated, contains hash to track data/ : outs: - md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir path: data ● Quite a few things happened when calling dvc add : ○ The hash of the content of data/ was computed and added to a new data.dvc file ○ DVC updates .gitignore to tell Git not to track the content of data/ ○ The physical content of data/ —i.e. the jpg images— has been moved to a cache (by default the cache is located in .dvc/cache/ but using a remote cache is possible!) ○ The files were linked back to the workspace so that it looks like nothing happened (the user can configure the link type to use: hard link, soft link, reflink, copy) → human readable, can be versioned with git!
  • 11. Francesco Casalegno – DVC ● To track the changes in our data with dvc, we follow the same procedure as before. Make changes to tracked data (add) 11 ● Let’s download some more data for our “cat VS dog” dataset. Running dvc diff will confirm that dvc is aware the data has changed! $ dvc get https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip $ unzip new-labels.zip & rm -f new-labels.zip inflating: data/train/cats/cat.501.jpg ... $ dvc diff Modified: data/ $ dvc add data/ $ git diff data.dvc outs: -- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir +- md5: 21060888834f7220846d1c6f6c04e649.dir path: data $ git commit -am "New version of data/ with more training images" $ git tag -a "v2.0" -m "data v2.0, 2000 images"
  • 12. Francesco Casalegno – DVC ● To fix this mismatch we simply call dvc checkout. This reads the cache and updates the data in the workspace based on the current *.dvc files. Switch between versions 12 ● To switch version, first run git checkout. This affects data.dvc but not workspace files in data/ ! $ git checkout v1.0 $ dvc diff Modified: data/ $ dvc checkout M data/ $ dvc status Data and pipelines are up to date.
  • 14. Francesco Casalegno – DVC $ mkdir -p ~/tmp/dvc_storage $ dvc remote add --default loc_remote ~/tmp/dvc_storage Setting 'loc_remote' as a default remote. $ git add .dvc/config $ git commit -m "Configure remote storage loc_remote" ● Many remote storages are supported (Google Drive, Amazon S3, Google Cloud, SSH, HDFS, HTTP, …) But we (as for Git) nothing prevents us to use a “local remote”! ● A remote storage is for dvc, what a GitHub is for git: ○ push and pull files from your workspace to the remote ○ easy sharing between developers ○ safe backup should you ever do a terrible mistake à la rm -rf * Configure remote storage 14 DVC-generated, contains remote storage config [core] remote = loc_remote ['remote "loc_remote"'] url = /root/tmp/dvc_storage
  • 15. Francesco Casalegno – DVC $ dvc push 1800 files pushed $ rm -rf .dvc/cache data $ dvc pull # update .dvc/cache with contents from remote 1800 files fetched $ dvc checkout # update workspace, linking data from .dvc/cache A data/ ● Now, even if all the data is deleted from our workspace and cache, we can download it with dvc pull. This is pretty much like git pull. ● Running basically dvc push uploads the content of the cache to the remote storage. This is pretty much like git push. Storing, sharing, retrieving from storage 15
  • 16. Francesco Casalegno – DVC $ dvc list https://github.com/iterative/dataset-registry README.md get-started/ tutorial/ ... $ dvc get https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip ● When working outside of a DVC project —e.g. in automated ML model deployment— use dvg get ● First, we can explore the content of a DVC repo hosted on a Git server. Access data from storage 16 ● Note. For all these commands we can specify a git revision (sha, branch, or tag) with --rev <commit>. ● When working inside of another DVC project, we want to keep the connection between the projects. In this way, others can know where the data comes from and whether new versions are available. dvc import is like dvc get + dvc add, but the resulting data.dvc also includes a ref to the source repo! $ dvc import https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip $ git add new-labels.zip.dvc .gitignore $ git commit -m "Import data from source"
  • 17. Francesco Casalegno – DVC ● We can build a DVC project dedicated only to tracking and versioning datasets and models. The repository would have all the metadata and history of changes in the different datasets. ● This a data registry, a middleware between ML projects and cloud storage. This introduces quite a few advantages. ○ Reusability — reproduce and organize feature stores with a simple dvc get / import ○ Optimization — track data shared by multiple projects centralized in a single location ○ Data as code — leverage Git workflow such as commits, branching, pull requests, CI/CD … ○ Persistence — a DVC registry-controlled remote storage improves data security ● But versioning large data files for data science is great, is not all DVC can do: DVC data pipelines capture how is data filtered, transformed, and used to train models! Data Registries 17
  • 19. Francesco Casalegno – DVC ● With dvc add we can track large files—this includes files such as: trained models, embeddings, etc. However, we also want to track how such files were generated for reproducibility and better tracking! Motivation The following is an example of a typical ML pipeline. Its structure is a DAG (direct acyclic graph). 19 data.xml train.tsv test.tsv train.pkl test.pkl model.pkl scores.json prc.json Prepare train-test split ◾ seed ◾ split prepare.py Featurize TF-IDF embed ◾ max_feats ◾ n_grams featurize.py Train RandomForest ◾ seed ◾ n_estimators train.py Evaluate PR curve, AUC evaluate.py Stage Name ◾ parameter script input output
  • 20. Francesco Casalegno – DVC → Advantages of Option B 1. outputs are automatically tracked (i.e. saved in .dvc/cache) 2. pipeline stages with parameters names are saved in dvc.yaml 3. deps, params, outs are all hashed and tracked in dvc.lock 4. like a Makefile, can reproduce by dvc run prepare—re-run only if deps changed! ● Option A: run pipeline stages, then track output artifacts with dvc add $ python src/prepare.py data/data.xml $ dvc add data/prepared/train.tsv data/prepared/train.tsv Tracking ML Pipelines 20 $ dvc run -n prepare -p prepare.seed -p prepare.split -d src/prepare.py -d data/data.xml -o data/prepared python src/prepare.py data/data.xml ● Option B: run pipeline stage and track them together with all dependencies with dvc run stage name parameters — read from params.yaml dependencies (including script!) outputs to track prepare: split: 0.20 seed: 20170428 featurize: max_feats: 500 ngrams: 1 ...
  • 21. Francesco Casalegno – DVC $ tree . ├── data │ ├── data.xml │ └── data.xml.dvc ├── params.yaml └── src ├── evaluate.py ├── featurization.py ├── prepare.py ├── requirements.txt └── train.py $ mkdir nlp_project & cd nlp_project $ git init & dvc init & git commit -m "Init dvc repo" ● Let’s create a DVC repo for an NLP project. Example 21 ● Then we download some data + some code to prepare the data and train/evaluate a model $ dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml $ dvc add data/data.xml $ git add data/.gitignore data/data.xml.dvc & git commit -m "Add data, first version" $ wget https://code.dvc.org/get-started/code.zip $ unzip code.zip & rm -f code.zip prepare: split: 0.20 seed: 20170428 featurize: max_features: 500 ngrams: 1 train: seed: 20170428 n_estimators: 50 YAML file with params for all the stages pipeline steps
  • 22. Francesco Casalegno – DVC $ dvc run -n prepare -p prepare.seed -p prepare.split -d src/prepare.py -d data/data.xml -o data/prepared python src/prepare.py data/data.xml $ git add data/.gitignore dvc.yaml dvc.lock ● Let’s run the prepare stage. Example 22 stages: prepare: cmd: python src/prepare.py data/data.xml deps: - data/data.xml - src/prepare.py params: - prepare.seed - prepare.split outs: - data/prepared prepare: cmd: python src/prepare.py data/data.xml deps: - path: data/data.xml md5: a304afb96060aad90176268345e10355 - path: src/prepare.py md5: 285af85d794bb57e5d09ace7209f3519 params: params.yaml: prepare.seed: 20170428 prepare.split: 0.2 outs: - params: data/prepared md5: 20b786b6e6f80e2b3fcf17827ad18597.dir Describe data pipelines, similar to how Makefiles work for building software. Matches the dvc.yaml file. Created and updated by DVC commands like dvc run. Describes latest pipeline state for: 1. track intermediate and final artifacts (like a .dvc file) 2. allow DVC to detect when stage defs or dependencies changed, triggering re-run. ● Note: dependencies and artifacts are automatically tracked, no need to dvc add them!
  • 23. Francesco Casalegno – DVC $ dvc run -n featurize -p featurize.max_features -p featurize.ngrams -d src/featurize.py -d data/prepared -o data/features python src/featurization.py data/prepared data/features $ git add data/.gitignore dvc.yaml dvc.lock ● Then we run the featurize and train stages in the same way. Example 23 $ dvc run -n train -p train.seed -p train.n_estimators -d src/train.py -d data/features -o model.pkl python src/train.py data/features model.pkl $ git add data/.gitignore dvc.yaml dvc.lock
  • 24. Francesco Casalegno – DVC $ dvc run -n evaluate -d src/evaluate.py -d model.pkl -d data/features --metrics-no-cache scores.json --plots-no-cache prc.json python src/evaluate.py model.pkl data/features scores.json prc.json $ git add dvc.yaml dvc.lock ● And finally we run the evaluation stage. Example 24 Declare output metrics file. A special kind of output file (-o), must be JSON and can be used to make comparisons across experiments in a tabular form. E.g. here it contains data for AUC score. The -no-cache prevents DVC to store the file in cache. Declare output plot file. A special kind of output file (-o), must be JSON and can be used to make comparisons across experiments in a plot form. E.g. here it contains data for ROC curve plot. The -no-cache prevents DVC to store the file in cache.
  • 25. Francesco Casalegno – DVC $ dvc dag +-------------------+ | data/data.xml.dvc | +-------------------+ * * * +---------+ | prepare | +---------+ * * * +-----------+ | featurize | +-----------+ ** ** ** * * ** +-------+ * | train | ** +-------+ * ** ** ** ** * * +----------+ | evaluate | +----------+ Plot dependency graphs 25
  • 26. Francesco Casalegno – DVC $ dvc repro train 'data/data.xml.dvc' didn't change, skipping Stage 'prepare' didn't change, skipping Stage 'featurize' didn't change, skipping Stage 'train' didn't change, skipping Data and pipelines are up to date. ● dvc repro regenerate data pipeline results, by restoring the DAG defined by stages listed in dvc.yaml. This compares file hashes with dvc.lock to re-run only if needed. This is like make in software builds. Reproducing Pipelines 26 ● Case 1: nothing changed, re-running pipeline stages is skipped. ● Case 2: a dependency changed, pipeline stages are re-run if needed. $ sed -i -e "s@max_features: 500@max_features: 1500@g" params.yaml $ dvc repro train 'data/data.xml.dvc' didn't change, skipping Stage 'prepare' didn't change, skipping Running stage 'featurize' with command: python src/featurization.py data/prepared data/features Updating lock file 'dvc.lock' Running stage 'train' with command: python src/train.py data/features model.pkl Updating lock file 'dvc.lock' To track the changes with git, run: git add dvc.lock
  • 27. Francesco Casalegno – DVC $ dvc plots diff -x recall -y precision file:///Users/dvc/example-get-started/plots.html $ dvc params diff Path Param Old New params.yaml featurize.max_features 500 1500 ● dvc params diff rev_1 rev_2 shows how parameters differ in two different git revisions/tags. Without arguments, it shows how they differ in workspace vs. last commit. Comparing experiments 27 $ dvc params diff Path Metric Value Change scores.json auc 0.61314 0.07139 ● dvc metrics diff rev_1 rev_2 does the same for metrics. ● dvc plots diff rev_1 rev_2 does the same for plots.
  • 29. Francesco Casalegno – DVC ● Disk space optimization Avoid having 1 cache per user! ● Use DVC as usual - Each dvc add or dvc run moves data to the shared external cache! - Each dvc checkout links required data to the workspace! Shared Development Server 29 ● See here for implementation details, but basically it’s not too difficult: $ mkdir -p path_shared_cache/ $ mv .dvc/cache/* path_shared_cache/ $ dvc cache dir path_shared_cache/ $ dvc config cache.shared group $ git commit -m "config shared cache"
  • 31. Francesco Casalegno – DVC Conclusions 31 ● DVC is a version control system for large ML data and artifacts. ● DVC integrates with Git through *.dvc and dvc.lock files, to version files and pipelines, respectively. ● DVC repos can work as data registries, i.e. a middleware between cloud storage and ML projects ● To track raw ML data files, use dvc add—e.g. for input dataset. ● To track intermediate or final results of a ML pipeline, use dvc run—e.g. for model weights, dataset. ● Consider using a shared development server with a unified, shared external cache