Toeplitz Block Mixing for Scalable Multi-Head Linear Attention
Abstract
Linear attention offers complexity for sequence modeling but struggles with associative recall tasks due to compressing all past information into a fixed-size summary. Multi-Head Linear Attention (MHLA) addresses this by learning a block mixing matrix that allows different query blocks to attend to different mixtures of past summaries, but introduces complexity in the number of blocks and cannot extrapolate to longer sequences. We analyze the mixing patterns learned by MHLA and discover they are approximately translation-invariant: fitting to a distance-tied kernel yields across all layers. Motivated by this finding, we propose Toeplitz Block Mixing (TBM), which parameterizes the mixing kernel as a mixture of exponentials . This reduces complexity from to and enables length extrapolation. On associative recall tasks, TBM achieves 7.3 higher accuracy than Dense MHLA (1.25% vs 0.17%) with 1.24 throughput improvement, and successfully extrapolates to 8 longer sequences.