Before a language model sees any text, a tokenizer breaks it into tokens. The dominant method is byte-pair encoding (BPE), which starts with individual characters and iteratively merges the most frequent adjacent pair into a new token. Each merge reduces the sequence length while building up a vocabulary of common subwords.
The algorithm is greedy: at each step, scan the entire sequence, count every adjacent pair, and merge the most frequent one. "t" + "h" becomes "th". "th" + "e" becomes "the". Common words become single tokens. Rare words stay as character fragments the model can still process.
This is why subword tokenizationworks so well. It avoids the fixed-vocabulary problem of whole-word tokenizers (which can't handle unseen words) while being far more compact than character-level encoding. GPT-4 uses ~100k BPE tokens. Claude uses a similar-sized vocabulary. The compression ratio (characters per token) is typically 3-4x for English.
Watch the merge log on the right. Each rule shows which pair was merged, in the order the algorithm discovered it. This ordered list of merge rules is the tokenizer. At inference time, the same rules are applied in the same order to any new text. The vocabulary and merge table are fixed after training.
The space character is typically represented as "▁" (lower one-eighth block) to make word boundaries explicit. This is the SentencePiece convention used by most modern tokenizers.
Sennrich et al. 2016