Autoregressive generation is slow: one token per forward pass. Speculative decoding uses a small "draft" model to guess multiple tokens, then the main model verifies them in parallel.
The draft model predicts - tokens quickly. The main model checks all guesses in one forward pass. Correct guesses are accepted; wrong ones are regenerated.
You get -x speedup when the draft model guesses well. This works best when outputs are predictable, like code completion or structured data.