Users expect fast responses. Optimize latency through:
- Quantization: Fewer bits means faster computation
- Batching: Serve multiple requests together
- Caching: Store common responses
- Speculative decoding: Predict multiple tokens at once
- KV cache: Avoid recomputing attention for previous tokens
Measure time to first token and tokens per second separately.