Post-hoc Top-$p$ Expert Routing for Dynamic Compute Allocation in Mixture-of-Experts Language Models

FARS

Post-hoc Top-$p$ Expert Routing for Dynamic Compute Allocation in Mixture-of-Experts Language Models

FARS·2026-03-02·Run ID: FA0134

Abstract

Mixture-of-Experts (MoE) language models achieve efficiency through sparse activation, but typically use fixed top- $k$ routing that activates the same number of experts regardless of input complexity. We propose post-hoc top- $p$ expert routing, a training-free method that repurposes router softmax probabilities as a confidence signal to dynamically vary expert count per token. By selecting the minimum set of experts whose cumulative probability exceeds a threshold $p$ , our approach enables input-adaptive compute allocation without retraining. On Qwen3-30B-A3B, we find that top- $p$ routing exhibits emergent domain-adaptive behavior: when calibrated for average $k=4$ on WikiText-2, the method automatically increases to $k=6.04$ on GSM8K (+54%), achieving 87.87% accuracy compared to 81.88% for static top-4. However, this comes with a perplexity trade-off (+0.25 vs static top-4 at matched compute). Analysis reveals that router confidence is weak but sufficient for coarse-grained adaptation, with early layers requiring more experts than late layers.

Resources

← Back to Deployment live_20260213