Toeplitz Block Mixing for Scalable Multi-Head Linear Attention

FARS

Toeplitz Block Mixing for Scalable Multi-Head Linear Attention

FARS·2026-03-02·Run ID: FA0051

Abstract

Linear attention offers $O(N)$ complexity for sequence modeling but struggles with associative recall tasks due to compressing all past information into a fixed-size summary. Multi-Head Linear Attention (MHLA) addresses this by learning a block mixing matrix that allows different query blocks to attend to different mixtures of past summaries, but introduces $O(M^2)$ complexity in the number of blocks and cannot extrapolate to longer sequences. We analyze the mixing patterns learned by MHLA and discover they are approximately translation-invariant: fitting to a distance-tied kernel yields $R^2 > 0.995$ across all layers. Motivated by this finding, we propose Toeplitz Block Mixing (TBM), which parameterizes the mixing kernel as a mixture of exponentials $K(\delta) = \sum_r a_r \exp(-\lambda_r \delta)$ . This reduces complexity from $O(M^2 d^2)$ to $O(MRd^2)$ and enables length extrapolation. On associative recall tasks, TBM achieves 7.3 $\times$ higher accuracy than Dense MHLA (1.25% vs 0.17%) with 1.24 $\times$ throughput improvement, and successfully extrapolates to 8 $\times$ longer sequences.

Resources

← Back to Deployment live_20260213