AR-Order RL Post-Training Reduces Order Robustness in Diffusion Language Models
Abstract
Diffusion language models (dLLMs) offer a unique advantage over autoregressive models: order robustness---the ability to solve reasoning problems regardless of whether reasoning precedes or follows the answer. However, the dominant approach to improving dLLM performance is reinforcement learning with autoregressive-order (AR-order) reward signals, such as JustGRPO. We investigate whether this training paradigm compromises order robustness. Comparing LLaDA-8B-Instruct (diffusion base), LLaDA-Instruct-JustGRPO (AR-order RL trained), and Qwen2.5-7B-Instruct (AR anchor) on ReasonOrderQA and GSM8K, we find that JustGRPO significantly reduces order robustness: the robustness ratio drops by 0.192 on ReasonOrderQA and 0.138 on GSM8K. JustGRPO's robustness profile sits between the diffusion base and AR anchor, covering approximately 53% of the gap. The degradation is concentrated at medium difficulty levels requiring multi-step reasoning. While AR-order RL improves CoT-First accuracy by up to 19.6 percentage points, this reveals a fundamental accuracy-robustness trade-off in dLLM post-training.