LLM inference is expensive. These techniques reduce cost and latency.
KV Caching: Store computed key-value pairs. Reuse for subsequent tokens. Essential optimization.
Speculative Decoding: Small model drafts tokens. Large model verifies in parallel. -× speedup.
Quantization: Reduce precision (FP → INT → INT). Smaller models, faster inference. Some quality loss.
Batching: Process multiple requests together. Better GPU utilization.
Interview question: "How does KV caching work?"
During generation, only the new token needs fresh K, V computation. Previous tokens' K, V are cached and reused.