When things go wrong:
- Loss not decreasing: Check learning rate, verify data is loading correctly
- NaN loss: Reduce learning rate, check for data issues
- Out of memory: Reduce batch size, enable gradient checkpointing
- Slow training: Verify GPU utilization, check for data loading bottlenecks
Log everything. Debugging without logs is guesswork.