Compute-Matched Evaluation of Transform-Augmented GRPO for Mathematical Reasoning
Abstract
Transform-Augmented GRPO (TA-GRPO) improves mathematical reasoning by generating semantic transformations of training prompts and pooling advantages across variants. However, prior comparisons with standard GRPO are confounded by compute differences: TA-GRPO uses more rollouts per original prompt. We present a compute-matched evaluation where both methods consume identical total rollouts (725K). Under this fair comparison, TA-GRPO achieves +2.02 percentage points higher Pass@32 than GRPO-Long (49.47% vs 47.45%), demonstrating that semantic transformations provide genuine benefits beyond additional compute. Ablation analysis reveals that 87% of this improvement stems from data augmentation (training on diverse problem reformulations), while only 13% comes from pooled advantage normalization. The advantage grows with inference-time compute (from +1.07pp at to +2.02pp at ), consistent with improved solution diversity.