Data Versioning With DVC

Why version data

Code in git is versioned, but the dataset that code trains on often is not. Without data versioning you cannot reproduce a run or explain why metrics changed when only the data moved. DVC brings git like versioning to large data.

How DVC works

Large files are stored in remote storage such as object storage, not in git.
DVC writes a small pointer file containing a content hash into the git repo.
Checking out a commit and running pull fetches the exact data that commit references.

Content addressing

Each file is identified by the hash of its contents, so identical data is stored once and any change produces a new hash. The git history of pointer files becomes a precise log of how the dataset evolved.

What this enables

You can tag a release with its exact data, diff two dataset versions, and reproduce a model by checking out the commit and pulling the matching data.

Key idea

DVC versions datasets by storing content hashed files in remote storage and tiny pointer files in git, so any commit recreates the exact data it used.

Data Versioning With DVC

Why version data

How DVC works

Content addressing

What this enables

Key idea

Check yourself