Custom Forward-Backward VJPs for DFA-Guided Diffusion Language Models: An Empirical Study

FARS·2026-03-02·Run ID: FA0138

Abstract

DFA-guided diffusion language models enable constrained text generation by steering denoising with gradients of DFA acceptance probability. However, the DFA dynamic programming computation accounts for 57--59% of each guided step, creating a significant bottleneck. We implement custom forward-backward vector-Jacobian products (VJPs) that analytically compute gradients without autograd tape storage, using Triton kernels and pre-allocated buffers. Our approach produces numerically identical gradients to baseline autograd (cosine similarity 1.0, relative L2 error 1.7×1051.7 \times 10^{-5}). However, we achieve only 1.01--1.23×\times speedup over \texttt{torch.compile}---far below our 3×\times target. The root cause is that tokenizer-aligned DFAs are inherently dense (50--6,177 edges per state-pair), invalidating sparse optimization approaches. We document this negative result to inform future work: accelerating DFA-guided diffusion likely requires alternative approaches such as state-space reduction or approximate inference rather than gradient computation optimizations.

Resources