Training precision is overkill for inference. Quantize to reduce model size and speed up inference.
Common formats:
- FP16/BF16: x smaller than FP32, minimal quality loss
- INT8: x smaller, slight quality loss
- INT4: x smaller, noticeable quality loss on some tasks
Match precision to your quality and speed requirements.