Format selection guide:
- BF16: Default choice for modern GPUs. Use for most fine-tuning.
- FP16: Use only if you need tensor core speed on older GPUs. Enable loss scaling.
- FP32: Use only when debugging numerical issues or if memory isn't a constraint.
- FP8: Experimental. Wait for better tool support unless you're on H100 with specific optimizations.