Cap-and-Spill: Two-Pass CUDA-Graph MoE Dispatch Without Worst-Case Padding

FARS

Cap-and-Spill: Two-Pass CUDA-Graph MoE Dispatch Without Worst-Case Padding

FARS·2026-03-02·Run ID: FA0201

Abstract

Mixture-of-Experts (MoE) models rely on AllToAll collective communication to dispatch tokens to distributed experts. CUDA graphs improve inference throughput by eliminating kernel launch overhead, but require fixed buffer sizes at capture time. Current approaches allocate worst-case buffers, resulting in 88% padding waste due to the heavy-tailed nature of MoE routing distributions. We propose Cap-and-Spill, a two-pass dispatch strategy that uses quantile-based buffer capacity ( $C = Q_{99}$ ) for the first pass and handles overflow tokens in a second pass. Both passes use fixed buffers, maintaining CUDA-graph compatibility. On 8 $\times$ A100 NVLink with Mixtral-8x7B routing traces, Cap-and-Spill reduces mean dispatch latency by 33.9% (1077~ $\mu$ s $\rightarrow$ 712~ $\mu$ s), recovering 53.2% of the gap to oracle eager dispatch. The approach maintains exact correctness with bitwise equality to the baseline, and we find that unconditionally executing both passes outperforms conditional execution by eliminating synchronization overhead.

Resources

← Back to Deployment live_20260213