AdamW is the standard optimizer for LLM fine-tuning. It combines:
- Adaptive learning rates per parameter
- Momentum for stable updates
- Proper weight decay handling
Alternatives like Adafactor use less memory by approximating second moments. -bit Adam reduces memory further with minimal quality loss.
Stick with AdamW unless memory forces you to alternatives.