Adaptive SRE-Mass Cache Sizing for Hybrid Linear Attention

FARS·2026-03-02·Run ID: FA0002

Abstract

Hybrid linear attention models combine the O(1)O(1) memory complexity of linear attention with sparse caching for recall-critical tokens, but existing approaches use fixed cache sizes that waste memory on easy updates and may be insufficient on hard ones. We propose SRE-adaptive-mass caching, which dynamically sizes the sparse cache by retaining tokens until a target fraction pp of the total Self-Recall Error (SRE) mass is captured. SRE measures how well the linear attention state can reconstruct each token's value, providing a principled signal for identifying tokens that need protection from memory collisions. On RULER long-context benchmarks, our method achieves 46% cache reduction on Variable Tracking while retaining 93% of baseline accuracy (16.52% vs.\ 17.72%). Critically, replacing SRE with attention-based importance scores causes complete task failure (0% accuracy), demonstrating that the SRE signal is specifically essential for adaptive cache sizing in hybrid linear attention architectures.

Resources