Compute-Matched Evaluation Reveals Task-Dependent Diffusion Planning Advantage
Abstract
Diffusion language models have demonstrated impressive performance on planning tasks, but existing comparisons with autoregressive (AR) models typically ignore substantial differences in inference compute---diffusion requires dozens of denoising steps while AR generates tokens in a single forward pass. We propose a compute-matched evaluation protocol that calibrates AR best-of- sampling to match diffusion wall-clock time, isolating the effect of the generation paradigm from computational budget. Evaluating Dream-7B (diffusion) against Qwen2.5-7B (AR) on two planning tasks, we find the diffusion advantage is task-dependent: on Countdown, compute-matched AR dominates by 32.5 percentage points (39.1% vs 6.6%); on Mini Sudoku, diffusion retains a significant advantage of 10.4 percentage points (77.6% vs 67.2%, 95% CI [+6.1, +14.6]). This pattern suggests diffusion may provide genuine advantages for constraint-satisfaction problems requiring global coherence, but not for sequential arithmetic reasoning where sampling diversity suffices.