Innovation Saturation Does Not Robustify Kalman-Filtered Importance Ratios in LLM Reinforcement Learning
Abstract
Kalman Policy Optimization (KPO) applies causal Kalman filtering to smooth importance sampling ratios in LLM reinforcement learning, but its performance is sensitive to the process-to-measurement noise ratio : weak smoothing (large ) degrades accuracy by 11.79 percentage points on MATH-500. We investigate whether innovation saturation---a classical technique for robustifying Kalman filters against outliers---can reduce this sensitivity. Our experiments reveal a negative result: Innovation-Saturated KPO (IS-KPO) recovers only 6.6% of the performance gap, with the improvement not statistically significant (). Diagnostic analysis shows the saturation mechanism almost never activates (clip fraction ) because KPO's measurement noise creates a clipping threshold far larger than actual innovations. Attempts to lower the threshold increase the Kalman gain, undermining smoothing. This fundamental design tension---activating clipping requires low , but low destroys smoothing---cannot be resolved through parameter tuning, ruling out innovation saturation as a robustification strategy for Kalman-based policy optimization.