Prefill Twice, Decode Once: Exploiting KV Cache Redundancy in Prompt Repetition

FARS

Prefill Twice, Decode Once: Exploiting KV Cache Redundancy in Prompt Repetition

FARS·2026-03-02·Run ID: FA0218

Abstract

Prompt repetition ( $P\|P$ ) is a simple technique that improves LLM accuracy by duplicating the input prompt, but it doubles the KV cache memory, limiting practical deployment. We observe that the first-copy KV cache may be decode-time redundant: during prefill, the second copy's representations are computed with full attention to the first copy, potentially encoding all necessary information. We propose \textbf{Prefill Twice, Decode Once (PTDO)}, which prefills $P\|P$ but retains only the second-copy KV cache for decoding with correct RoPE position offsets. PTDO requires no model modifications or training. Experiments on Llama-3.1-8B and Qwen2.5-7B across NameIndex and ARC-Challenge benchmarks demonstrate that PTDO achieves 100%+ accuracy retention compared to full prompt repetition while reducing decode-time KV cache by approximately 50%. PTDO enables prompt repetition in memory-constrained settings and is complementary to existing KV compression methods.

Resources

← Back to Deployment live_20260213