Beam Search · kerkstra.dev

prompt

beam width

active beams

best sequence

p=0.0000

beam width

speed

space=pauser=reset

beam search

space to pause · r to reset

how it works

Greedy decoding picks the highest-probability token at each step. It is fast but shortsighted. A locally optimal choice can lead to a globally poor sequence. Beam search fixes this by maintaining multiple candidate sequences (beams) in parallel, keeping the top B at each step.

At each generation step, every active beam is extended with all possible next tokens. The candidates are scored by cumulative probability (the product of all token probabilities along the path). Only the top B scoring sequences survive. The rest are pruned.

Set beam width to 1 and you get greedy search. A single path, no alternatives considered. Increase it to 3 or 5 and watch how the tree explores more branches. Higher beam widths find better sequences at the cost of more computation (linear in beam width per step).

The sequence score shown at the top is the log-sum of probabilities along the best path. Real systems often apply length normalization to avoid biasing toward shorter sequences. The visualization shows this tradeoff: wider beams explore more, pruning identifies globally better paths that greedy would have missed.

Wikipedia: Beam search