Adaptive SRE-Mass Cache Sizing for Hybrid Linear Attention
Abstract
Hybrid linear attention models combine the memory complexity of linear attention with sparse caching for recall-critical tokens, but existing approaches use fixed cache sizes that waste memory on easy updates and may be insufficient on hard ones. We propose SRE-adaptive-mass caching, which dynamically sizes the sparse cache by retaining tokens until a target fraction of the total Self-Recall Error (SRE) mass is captured. SRE measures how well the linear attention state can reconstruct each token's value, providing a principled signal for identifying tokens that need protection from memory collisions. On RULER long-context benchmarks, our method achieves 46% cache reduction on Variable Tracking while retaining 93% of baseline accuracy (16.52% vs.\ 17.72%). Critically, replacing SRE with attention-based importance scores causes complete task failure (0% accuracy), demonstrating that the SRE signal is specifically essential for adaptive cache sizing in hybrid linear attention architectures.