Overlap-Refresh: Decoupling Window Shifts from Full KV Refresh in Diffusion Language Models
Abstract
Diffusion language models enable parallel generation and bidirectional context but suffer from expensive iterative inference. Window-Diffusion addresses this via sliding window attention with periodic KV cache refresh, but couples window shifts with full refresh operations---forcing expensive recomputation even when consecutive windows overlap substantially. We propose Overlap-Refresh, which decouples these operations by introducing two independent scheduling parameters: shift interval and refresh interval. At shift-only boundaries, we use delta-prefill to compute KV only for newly entered tokens while reusing cached KV for overlap tokens, achieving cost versus for full refresh. On MBPP code generation, Overlap-Refresh achieves 6.0% throughput improvement (8.59 vs 8.10 tokens/sec) while preserving quality within 0.6 percentage points of the baseline (54.4% vs 55.0% Pass@1). Runtime analysis confirms delta-prefill is 3.3 cheaper than full refresh, validating our decoupling approach.