During autoregressive generation, each new token attends to every previous token. Without caching, this means recomputing every keyand value vector from scratch at every step, turning generation into an O(n²) operation. The KV cache stores these vectors so each token computes once and stays in memory.
Toggle between cached and uncached modes to see the difference. In cached mode, only the new row (the current token) lights up. In uncached mode, watch the entire matrix recompute top-to-bottom every step. The memory bar on the right grows linearly either way, but the computation cost diverges dramatically.
Prompt caching (used by Anthropic and others) extends this idea across requests. If two API calls share the same system prompt prefix, the KV pairs for that prefix are stored and reused. This can reduce first-token latency by 85% and cost by 90% on cache hits. The cache has a TTL (typically 5 minutes). Keeping the cache warm means structuring prompts so the prefix stays identical across calls.
The attention arrows show the core operation: the new token's query vector dot-products against every cached key to produce attention weights. With KV caching, the keys are already there. Without it, you're rebuilding the entire phone book just to look up one number.
Wikipedia