Autoregressive generation is sequential: each token waits for the previous one. Speculative decoding breaks this bottleneck. A small, fast draft model proposes K tokens in rapid succession, then the large target model verifies all K in a single parallel forward pass. Accepted tokens are mathematically guaranteed to match what the target model would have generated on its own.
The verification uses modified rejection sampling. For each draft token, compare the draft model's probability to the target model's. If the target agrees (or assigns higher probability), accept. If not, reject with probability proportional to the divergence, and resample from an adjusted distribution. This ensures the final output distribution is identical to pure target-model sampling.
The speedup depends on how well the draft model approximates the target. High acceptance rate means most speculated tokens survive, yielding up to Kx throughput improvement. The "draft quality" slider controls this. Watch how acceptance rate drops when the draft is weaker, causing more cascade rejections.
This is the dominant inference optimization in 2025. Google's Gemini, Meta's LLaMA, and Anthropic's Claude all use variants. EAGLE-3 extends it with tree-structured speculation, exploring multiple continuation paths simultaneously rather than a single linear chain.
Leviathan et al. 2023