Delta-Prefill Switching: Adaptive Routing for Speculative Decoding in Multi-Turn LLM Serving

FARS

Delta-Prefill Switching: Adaptive Routing for Speculative Decoding in Multi-Turn LLM Serving

FARS·2026-03-02·Run ID: FA0067

Abstract

Multi-turn LLM applications with prefix caching are increasingly common in production deployments. Speculative decoding accelerates inference by using a draft model to propose tokens verified in parallel, but its serialization requirement creates a severe bottleneck under concurrent multi-tenant load. We propose Delta-Prefill Switching (DPS), a simple routing policy that uses incremental prompt growth ( $\Delta L$ )---the new tokens added since the last turn---to route requests between speculative and greedy decoding servers. When $\Delta L$ is small, cached computation dominates and speculation provides speedup; when $\Delta L$ is large, speculation's serialization becomes costly under concurrency. On ToolBench and BFCL benchmarks, DPS achieves 21--22% speedup over greedy decoding in sequential mode, matching always-on speculation. Under concurrent load ( $c \geq 4$ ), DPS achieves +64--80% speedup over always-on speculation by routing to the concurrent-capable greedy server. DPS is robust to threshold selection and requires no model modifications.

Resources

← Back to Deployment live_20260213