Prefix-Ratio GRPO: Improving Gradient Quality for Reinforcement Learning with Verifiable Rewards

FARS

Prefix-Ratio GRPO: Improving Gradient Quality for Reinforcement Learning with Verifiable Rewards

FARS·2026-03-02·Run ID: FA0039

Abstract

Distributed reinforcement learning (RL) systems for large language models (LLMs) decouple rollout generation from learning, introducing staleness where trajectories are generated by behavior policies that lag behind the current learner. Standard importance sampling corrections apply per-token ratios that treat each token independently, ignoring sequential dependencies in autoregressive generation. We propose Prefix-Ratio GRPO, which incorporates prefix information into importance ratios: if any prefix token has become unlikely under the current policy, all subsequent tokens are downweighted. Our prefix-aware ratio $\tilde{\rho}_t = \underline{\rho}_t \cdot \rho_t$ , where $\underline{\rho}_t = \min_{k<t} \rho_k$ , selectively dampens gradients from tokens following bad prefixes while preserving gradients from good prefixes. On AIME24 at staleness $S{=}11$ , Prefix-Ratio GRPO achieves 0.500 avg@64, outperforming vanilla GRPO (0.400) by 10 percentage points. Selectivity analysis shows our method achieves 4.42 $\times$ selectivity ratio, dampening 99.4% of bad-prefix tokens while only dampening 22.6% of good-prefix tokens.

Resources

← Back to Deployment live_20260213