Acceptance-Controlled MIS-PO: Adaptive Trajectory Filtering for Stable Off-Policy RLVR Training
Abstract
Off-policy reinforcement learning enables high-throughput training of large language models by decoupling rollout generation from gradient updates, but introduces distribution shift that can destabilize training under high staleness. Existing methods either crash due to gradient explosion (fixed-bound filtering) or underperform (variance control). We propose Acceptance-Controlled MIS-PO (AC-MIS-PO), which adapts trajectory filtering bounds using a quantile-based controller that targets a pre-specified acceptance rate schedule. The controller uses exponential moving average smoothing to discover appropriate bound magnitudes automatically, without manual tuning. On mathematical reasoning benchmarks under staleness , AC-MIS-PO achieves 32.57% average accuracy across Math500/AIME24/AIME25, outperforming Fixed MIS-PO (18.67%), M2PO (18.40%), and GRPO (6.57%) while maintaining stable training. Ablation studies reveal that bound magnitude is the primary driver of improvement (+12.76pp from tighter bounds alone), with adaptive control providing automatic discovery of optimal settings.