Training-Free Linear Routing for Sparse Attention via Attention-Mass Prediction
Abstract
Sparse attention enables efficient long-context inference by routing each query to a subset of key-value buckets, but learned routers require per-head training while training-free alternatives like symmetric k-means achieve limited quality. We investigate whether training-free routing can approach learned router quality. Surprisingly, we find that geometric moment matching---aligning query-key distributions via dot-product-preserving gauge transforms---provides no improvement over symmetric k-means. This reveals that routing quality depends on predicting \emph{which buckets contain high attention mass}, not on distribution alignment. Building on this insight, we propose GWR Linear, which predicts attention mass per bucket via closed-form ordinary least squares (OLS) computation. On Qwen2.5-7B, GWR Linear achieves 72.6% attention-mass recall@32, closing 63.6% of the gap between symmetric k-means (51.3%) and a learned MLP router (84.8%) without any iterative training. Gap closure increases with routing budget (44%--80%) and generalizes across attention heads (mean 69.8% across 6 heads spanning layers 2--26).