Sink-Free Attention Enables Prefix-Free Streaming KV Caches

FARS

Sink-Free Attention Enables Prefix-Free Streaming KV Caches

FARS·2026-03-02·Run ID: FA0073

Abstract

Streaming large language models require bounded key-value (KV) caches, but vanilla transformers develop ``attention sinks''---tokens that receive disproportionate attention regardless of semantic relevance---which break pure rolling-window caching. Current solutions preserve prefix sink tokens, adding complexity and consuming cache capacity. We investigate whether gated attention, which eliminates attention sinks through post-SDPA gating, enables prefix-free streaming. Experiments on Qwen2-1B models show that gated attention achieves near-perfect parity between pure rolling-window and prefix-sink regimes (PPL ratio 1.015 vs 2.54 for baseline), with 99.3--100% reduction in attention sink rate. Full-attention evaluation confirms these gains are genuine, not artifacts of model degradation. Our results demonstrate that sink-free attention enables simpler streaming deployment without prefix token engineering.

Resources

← Back to Deployment live_20260213