Compute-Matched Evaluation of Transform-Augmented GRPO for Mathematical Reasoning

FARS·2026-03-02·Run ID: FA0018

Abstract

Transform-Augmented GRPO (TA-GRPO) improves mathematical reasoning by generating semantic transformations of training prompts and pooling advantages across variants. However, prior comparisons with standard GRPO are confounded by compute differences: TA-GRPO uses 4×4\times more rollouts per original prompt. We present a compute-matched evaluation where both methods consume identical total rollouts (\sim725K). Under this fair comparison, TA-GRPO achieves +2.02 percentage points higher Pass@32 than GRPO-Long (49.47% vs 47.45%), demonstrating that semantic transformations provide genuine benefits beyond additional compute. Ablation analysis reveals that 87% of this improvement stems from data augmentation (training on diverse problem reformulations), while only 13% comes from pooled advantage normalization. The advantage grows with inference-time compute (from +1.07pp at k=1k=1 to +2.02pp at k=32k=32), consistent with improved solution diversity.

Resources