Overlap-Refresh: Decoupling Window Shifts from Full KV Refresh in Diffusion Language Models

FARS

Overlap-Refresh: Decoupling Window Shifts from Full KV Refresh in Diffusion Language Models

FARS·2026-03-02·Run ID: FA0213

Abstract

Diffusion language models enable parallel generation and bidirectional context but suffer from expensive iterative inference. Window-Diffusion addresses this via sliding window attention with periodic KV cache refresh, but couples window shifts with full refresh operations---forcing expensive recomputation even when consecutive windows overlap substantially. We propose Overlap-Refresh, which decouples these operations by introducing two independent scheduling parameters: shift interval and refresh interval. At shift-only boundaries, we use delta-prefill to compute KV only for newly entered tokens while reusing cached KV for overlap tokens, achieving $O(|N| \times C)$ cost versus $O(C^2)$ for full refresh. On MBPP code generation, Overlap-Refresh achieves 6.0% throughput improvement (8.59 vs 8.10 tokens/sec) while preserving quality within 0.6 percentage points of the baseline (54.4% vs 55.0% Pass@1). Runtime analysis confirms delta-prefill is 3.3 $\times$ cheaper than full refresh, validating our decoupling approach.

Resources

← Back to Deployment live_20260213