Post-hoc Top-$p$ Expert Routing for Dynamic Compute Allocation in Mixture-of-Experts Language Models

FARS·2026-03-02·Run ID: FA0134

Abstract

Mixture-of-Experts (MoE) language models achieve efficiency through sparse activation, but typically use fixed top-kk routing that activates the same number of experts regardless of input complexity. We propose post-hoc top-pp expert routing, a training-free method that repurposes router softmax probabilities as a confidence signal to dynamically vary expert count per token. By selecting the minimum set of experts whose cumulative probability exceeds a threshold pp, our approach enables input-adaptive compute allocation without retraining. On Qwen3-30B-A3B, we find that top-pp routing exhibits emergent domain-adaptive behavior: when calibrated for average k=4k=4 on WikiText-2, the method automatically increases to k=6.04k=6.04 on GSM8K (+54%), achieving 87.87% accuracy compared to 81.88% for static top-4. However, this comes with a perplexity trade-off (+0.25 vs static top-4 at matched compute). Analysis reveals that router confidence is weak but sufficient for coarse-grained adaptation, with early layers requiring more experts than late layers.

Resources