Track these metrics during training:
- Training loss (should decrease)
- Validation loss (watch for overfitting)
- Learning rate (verify schedule is working)
- Gradient norm (spikes indicate instability)
- GPU memory usage (verify you're not exceeding limits)
Use Weights & Biases, TensorBoard, or MLflow. Good logging makes debugging possible.