Does iGRPO Need a Good Draft? Best-vs-Worst Self-Conditioning Ablation for RLVR Math
Abstract
Iterative Group Relative Policy Optimization (iGRPO) improves mathematical reasoning in large language models by conditioning refinement training on self-generated drafts selected by reward. However, it remains unclear whether iGRPO's benefit stems from conditioning on high-quality drafts or from the two-stage structure itself. We design a controlled ablation study comparing three conditions: GRPO baseline, iGRPO with best-of-N draft selection, and iGRPO with worst-of-formatted draft selection (intentionally selecting low-quality but well-formatted drafts). Surprisingly, worst-of-formatted selection not only recovers but \emph{exceeds} best-of-N performance, achieving 64.37% vs 61.94% macro-average accuracy across six math benchmarks. The recovery ratio of 1.34 (95% CI: [1.21, 1.47]) on MATH500 demonstrates that draft quality is not necessary for iGRPO's benefit. Analysis reveals that worst-of-formatted selection produces 50% more gradient-active training groups, potentially explaining its superior performance. These findings suggest that iGRPO can be simplified by removing reward-based draft selection.