Sink-Free Attention Enables Prefix-Free Streaming KV Caches
Abstract
Streaming large language models require bounded key-value (KV) caches, but vanilla transformers develop ``attention sinks''---tokens that receive disproportionate attention regardless of semantic relevance---which break pure rolling-window caching. Current solutions preserve prefix sink tokens, adding complexity and consuming cache capacity. We investigate whether gated attention, which eliminates attention sinks through post-SDPA gating, enables prefix-free streaming. Experiments on Qwen2-1B models show that gated attention achieves near-perfect parity between pure rolling-window and prefix-sink regimes (PPL ratio 1.015 vs 2.54 for baseline), with 99.3--100% reduction in attention sink rate. Full-attention evaluation confirms these gains are genuine, not artifacts of model degradation. Our results demonstrate that sink-free attention enables simpler streaming deployment without prefix token engineering.