AdamW, the standard optimizer, stores additional values per parameter:
- First moment (mean of gradients)
- Second moment (variance of gradients)
This triples memory for optimizer states. A B model in FP32 needs GB for parameters plus GB for optimizer states = GB just for these components.
This is why full fine-tuning requires so much memory.