Track your datasets like code. Every change should be versioned.
Tools:
- DVC (Data Version Control): Git for data
- Hugging Face Datasets: Built-in versioning
- Simple approach: Include date/version in filenames
When something goes wrong with training, you need to know exactly which data you used. Reproducibility requires version control.