Activations are intermediate values saved during forward pass for use in backward pass. They depend on:
- Model architecture (number of layers)
- Sequence length
- Batch size
Activations often dominate memory usage. A B model with batch size and sequence length can use GB+ for activations alone.
This is why batch size affects memory so much.