Imbalanced datasets cause problems. If % of your examples are one category, the model learns to always predict that category.
Balance strategies:
- Undersample: Remove examples from overrepresented classes
- Oversample: Duplicate underrepresented examples
- Weighted sampling: Sample rare classes more frequently during training
Aim for roughly equal representation across categories you care about.