Mixed precision keeps some values in FP32 (for stability) while using BF16/$FP16 for most operations.
Typical savings:
- Parameters: bytes instead of (50% reduction)
- Activations: bytes instead of (50% reduction)
- Gradients: bytes instead of (50% reduction)
Optimizer states often stay in FP32 for stability. Still, total memory drops by ~-%.