During inference, the KV cache stores key and value tensors from previous tokens. Without it, you'd recompute attention for the entire context at every step.
KV cache memory grows with sequence length and batch size. A B model with K context can use + GB just for KV cache.
Production systems must manage this carefully. vLLM's PagedAttention allocates KV cache in blocks, avoiding fragmentation. Monitor KV cache memory alongside model memory.