Prefix-Ratio GRPO: Improving Gradient Quality for Reinforcement Learning with Verifiable Rewards
Abstract
Distributed reinforcement learning (RL) systems for large language models (LLMs) decouple rollout generation from learning, introducing staleness where trajectories are generated by behavior policies that lag behind the current learner. Standard importance sampling corrections apply per-token ratios that treat each token independently, ignoring sequential dependencies in autoregressive generation. We propose Prefix-Ratio GRPO, which incorporates prefix information into importance ratios: if any prefix token has become unlikely under the current policy, all subsequent tokens are downweighted. Our prefix-aware ratio , where , selectively dampens gradients from tokens following bad prefixes while preserving gradients from good prefixes. On AIME24 at staleness , Prefix-Ratio GRPO achieves 0.500 avg@64, outperforming vanilla GRPO (0.400) by 10 percentage points. Selectivity analysis shows our method achieves 4.42 selectivity ratio, dampening 99.4% of bad-prefix tokens while only dampening 22.6% of good-prefix tokens.