R-MEL: Recovering Contrastive Signal from All-Negative Groups via Prefix-Primed Revision
Abstract
Reinforcement learning with verifiable rewards (RLVR) has enabled significant advances in LLM reasoning through contrastive learning from trajectory groups. However, when all sampled trajectories fail verification (``all-negative'' groups), no contrastive pair exists and the group is discarded---wasting approximately 30% of training compute. We observe that failed trajectories often contain correct reasoning prefixes before diverging into errors. We propose R-MEL (Revision-Augmented Meta-Experience Learning), which recovers contrastive signal from all-negative groups by truncating failed trajectories at candidate bifurcation points and generating correct continuations. On mathematical reasoning benchmarks, R-MEL achieves 33.17 average Pass@1 and the highest Avg@8 (30.78) across all conditions, outperforming baselines on 3 of 5 benchmarks including a statistically significant +4.0 percentage point improvement on MATH-500. Analysis reveals an inverted-U pattern in revision effectiveness: intermediate-difficulty prompts show the highest success rate (15.4%), providing insight into when prefix-primed revision is most beneficial.