During autoregressive generation, you compute attention for each new token against all previous tokens. Without caching, you'd recompute keys and values for the entire sequence at each step.
The KV cache stores computed key and value vectors. When generating token , you only compute K and V for token and retrieve the rest from cache.
This turns per-token work into . The trade-off is memory: KV cache size grows linearly with sequence length. GQA and MQA reduce this memory cost.