← Lessons

quiz vs the machine

Gold1400

Machine Learning

Data Versioning With DVC

Treating datasets like code with content addressed, git linked versions.

5 min read · core · beat Gold to climb

Why version data

Code in git is versioned, but the dataset that code trains on often is not. Without data versioning you cannot reproduce a run or explain why metrics changed when only the data moved. DVC brings git like versioning to large data.

How DVC works

  • Large files are stored in remote storage such as object storage, not in git.
  • DVC writes a small pointer file containing a content hash into the git repo.
  • Checking out a commit and running pull fetches the exact data that commit references.

Content addressing

Each file is identified by the hash of its contents, so identical data is stored once and any change produces a new hash. The git history of pointer files becomes a precise log of how the dataset evolved.

What this enables

You can tag a release with its exact data, diff two dataset versions, and reproduce a model by checking out the commit and pulling the matching data.

Key idea

DVC versions datasets by storing content hashed files in remote storage and tiny pointer files in git, so any commit recreates the exact data it used.

Check yourself

Answer to earn rating on the learn ladder.

1. Where does DVC store the large data files themselves?

2. What does content addressing by hash provide?