Why version data
Code in git is versioned, but the dataset that code trains on often is not. Without data versioning you cannot reproduce a run or explain why metrics changed when only the data moved. DVC brings git like versioning to large data.
How DVC works
- Large files are stored in remote storage such as object storage, not in git.
- DVC writes a small pointer file containing a content hash into the git repo.
- Checking out a commit and running pull fetches the exact data that commit references.
Content addressing
Each file is identified by the hash of its contents, so identical data is stored once and any change produces a new hash. The git history of pointer files becomes a precise log of how the dataset evolved.
What this enables
You can tag a release with its exact data, diff two dataset versions, and reproduce a model by checking out the commit and pulling the matching data.
Key idea
DVC versions datasets by storing content hashed files in remote storage and tiny pointer files in git, so any commit recreates the exact data it used.