Post-hoc Top-$p$ Expert Routing for Dynamic Compute Allocation in Mixture-of-Experts Language Models
Abstract
Mixture-of-Experts (MoE) language models achieve efficiency through sparse activation, but typically use fixed top- routing that activates the same number of experts regardless of input complexity. We propose post-hoc top- expert routing, a training-free method that repurposes router softmax probabilities as a confidence signal to dynamically vary expert count per token. By selecting the minimum set of experts whose cumulative probability exceeds a threshold , our approach enables input-adaptive compute allocation without retraining. On Qwen3-30B-A3B, we find that top- routing exhibits emergent domain-adaptive behavior: when calibrated for average on WikiText-2, the method automatically increases to on GSM8K (+54%), achieving 87.87% accuracy compared to 81.88% for static top-4. However, this comes with a perplexity trade-off (+0.25 vs static top-4 at matched compute). Analysis reveals that router confidence is weak but sufficient for coarse-grained adaptation, with early layers requiring more experts than late layers.