Toeplitz Block Mixing for Scalable Multi-Head Linear Attention

FARS·2026-03-02·Run ID: FA0051

Abstract

Linear attention offers O(N)O(N) complexity for sequence modeling but struggles with associative recall tasks due to compressing all past information into a fixed-size summary. Multi-Head Linear Attention (MHLA) addresses this by learning a block mixing matrix that allows different query blocks to attend to different mixtures of past summaries, but introduces O(M2)O(M^2) complexity in the number of blocks and cannot extrapolate to longer sequences. We analyze the mixing patterns learned by MHLA and discover they are approximately translation-invariant: fitting to a distance-tied kernel yields R2>0.995R^2 > 0.995 across all layers. Motivated by this finding, we propose Toeplitz Block Mixing (TBM), which parameterizes the mixing kernel as a mixture of exponentials K(δ)=rarexp(λrδ)K(\delta) = \sum_r a_r \exp(-\lambda_r \delta). This reduces complexity from O(M2d2)O(M^2 d^2) to O(MRd2)O(MRd^2) and enables length extrapolation. On associative recall tasks, TBM achieves 7.3×\times higher accuracy than Dense MHLA (1.25% vs 0.17%) with 1.24×\times throughput improvement, and successfully extrapolates to 8×\times longer sequences.

Resources