Does MIS-PO Need Ratio-Based Trajectory Selection? A Random-Rejection Mechanism Test
Abstract
Off-policy reinforcement learning for LLMs uses importance sampling with clipping to handle distribution shift from stale rollouts. MIS-PO extends this with trajectory-level filtering based on geometric mean ratios, claiming stability benefits. We ask: is trajectory-level filtering necessary, or does the ratio-based criterion matter? We design a three-condition experiment: MIS-PO (full method), TokenOnly (token-level filtering only), and RandomTraj (random trajectory rejection matching MIS-PO's acceptance rate). On MATH-500 with staleness , TokenOnly (59.25%) dramatically outperforms MIS-PO (2.85%) by 56.4 percentage points. RandomTraj (40.85%) outperforms MIS-PO by 38.0pp despite identical acceptance rates, demonstrating that random selection achieves 14 better accuracy than ratio-based selection. Analysis reveals that MIS-PO's narrow bounds systematically retain trajectories closest to the reference policy, which carry minimal learning signal. Token-level importance weighting alone suffices and approaches published GRPO baselines.