A dense transformer sends every token through every parameter. Mixture of Experts replaces the monolithic feed-forward layer with N smaller expert sub-networks and a learned router that assigns each token to the top-K experts. A 235B-parameter model might only activate 22B parameters per token.
The router computes a softmax over expert scores for each token, then selects the top-K. Only those K experts run their forward pass. The outputs are combined with the router's gating weights. This is why MoE models can be much larger in total parameters while matching the inference cost of a smaller dense model.
Watch the expert utilization heatmap at the bottom. Without load balancing, a few popular experts would handle most tokens while others sit idle, wasting capacity. An auxiliary loss term penalizes uneven routing, pushing the distribution toward uniform. Switch input types to see how different experts specialize: code tokens route differently than prose or math.
This architecture powers DeepSeek-V3, Qwen3-235B, and Kimi K2. The key insight is that not all tokens need the same computation. Simple tokens (articles, punctuation) can use lightweight experts while complex tokens (technical terms, ambiguous syntax) route to heavier specialists. This is sparse activation in practice.
Wikipedia