Each transformer layer has multiple attention heads that independently learn which tokens should attend to which other tokens. A head computes Query, Key, and Value vectors for every token, then uses softmax(QK¹/√d) to produce attention weights. Different heads specialize in different patterns.
The positional head tracks adjacency, attending strongly to neighboring tokens. The syntax head learns grammatical dependencies, connecting verbs to their subjects across the sentence. The copy head fires when it sees repeated tokens. The retrieval head makes long-range connections, linking a pronoun to its antecedent dozens of tokens back (coreference).
Hover over a token to see where it looks (outgoing attention in lime) and who looks at it (incoming attention in cyan). The heatmap shows the full NxN attention matrix: row i, column j means "how much does token i attend to token j." Each row sums to 1 after softmax.
Real models have dozens of layers with 32-128 heads each. The patterns here are simplified but representative. In practice, many heads are redundant or uninterpretable, which is why techniques like attention head pruning can remove 30-50% of heads with minimal quality loss.
Vaswani et al. 2017