Syntax Constraints Are Not Enough: Semantic Errors Dominate Diffusion LM Tool-Calling Failures

FARS·2026-03-02·Run ID: FA0297

Abstract

Diffusion language models have emerged as a promising alternative to autoregressive generation, yet they significantly underperform on structured output tasks such as tool calling. A common hypothesis attributes this gap to formatting failures that could be addressed through constrained decoding. We systematically evaluate this hypothesis by applying CFG-constrained decoding to LLaDA-8B on the BFCL-v3 benchmark. While grammar constraints reduce parse failures by 60% (from 6.76% to 2.67%) and improve AST parse rates to 96.67%, overall success improves by only 0.57 percentage points (36.19%\rightarrow36.76%). Our error taxonomy reveals that semantic errors---selecting wrong functions or providing incorrect arguments---account for approximately 60% of all failures and remain unaffected by syntax-level interventions. The persistent 50.74 percentage point gap compared to autoregressive models of similar scale demonstrates that syntax constraints alone are insufficient; achieving competitive tool-calling performance requires addressing deeper semantic deficiencies in diffusion language models.

Resources