Batch size affects training dynamics significantly:
- Larger batches: More stable gradients, fewer updates per epoch
- Smaller batches: Noisier gradients, more updates, can escape local minima
Typical effective batch sizes: - for fine-tuning. If you can only fit small batches in memory, use gradient accumulation to simulate larger ones.