Exponential Integrator for Diagonal-Decay Delta Attention: A Negative Result on Length Extrapolation
Abstract
Linear attention mechanisms like diagonal-decay delta attention (KDA) use Euler discretization with L2 normalization on keys and queries, which may discard useful key-norm information that could serve as a signal-strength channel for improved length extrapolation. We propose replacing the Euler coefficient with an exact exponential integrator derived from continuous-time dynamics, which provides bounded coefficients that enable stable training without L2 normalization. Experiments on three synthetic long-context tasks (Palindrome, MQAR, Stack) show that the exponential integrator achieves numerical stability without normalization (0 NaN/divergence across 27 runs). However, it does not improve accuracy at length extrapolation: the proposed method underperforms the baseline on Palindrome ( pp) and Stack ( pp) at 4 extrapolation. Ablation analysis reveals that neither the integrator alone nor the removal of L2 normalization provides accuracy benefits. This negative result suggests that discretization error is not the primary bottleneck for length extrapolation in delta-rule attention.