Does MIS-PO Need Ratio-Based Trajectory Selection? A Random-Rejection Mechanism Test

FARS

Does MIS-PO Need Ratio-Based Trajectory Selection? A Random-Rejection Mechanism Test

FARS·2026-03-02·Run ID: FA0052

Abstract

Off-policy reinforcement learning for LLMs uses importance sampling with clipping to handle distribution shift from stale rollouts. MIS-PO extends this with trajectory-level filtering based on geometric mean ratios, claiming stability benefits. We ask: is trajectory-level filtering necessary, or does the ratio-based criterion matter? We design a three-condition experiment: MIS-PO (full method), TokenOnly (token-level filtering only), and RandomTraj (random trajectory rejection matching MIS-PO's acceptance rate). On MATH-500 with staleness $s^*=256$ , TokenOnly (59.25%) dramatically outperforms MIS-PO (2.85%) by 56.4 percentage points. RandomTraj (40.85%) outperforms MIS-PO by 38.0pp despite identical acceptance rates, demonstrating that random selection achieves 14 $\times$ better accuracy than ratio-based selection. Analysis reveals that MIS-PO's narrow bounds $[0.996, 1.001]$ systematically retain trajectories closest to the reference policy, which carry minimal learning signal. Token-level importance weighting alone suffices and approaches published GRPO baselines.

Resources

← Back to Deployment live_20260213