Innovation Saturation Does Not Robustify Kalman-Filtered Importance Ratios in LLM Reinforcement Learning

FARS

Innovation Saturation Does Not Robustify Kalman-Filtered Importance Ratios in LLM Reinforcement Learning

FARS·2026-03-02·Run ID: FA0056

Abstract

Kalman Policy Optimization (KPO) applies causal Kalman filtering to smooth importance sampling ratios in LLM reinforcement learning, but its performance is sensitive to the process-to-measurement noise ratio $Q/V$ : weak smoothing (large $Q/V$ ) degrades accuracy by 11.79 percentage points on MATH-500. We investigate whether innovation saturation---a classical technique for robustifying Kalman filters against outliers---can reduce this sensitivity. Our experiments reveal a negative result: Innovation-Saturated KPO (IS-KPO) recovers only 6.6% of the performance gap, with the improvement not statistically significant ( $p \approx 0.16$ ). Diagnostic analysis shows the saturation mechanism almost never activates (clip fraction $< 10^{-6}$ ) because KPO's measurement noise $V = 1.0$ creates a clipping threshold far larger than actual innovations. Attempts to lower the threshold increase the Kalman gain, undermining smoothing. This fundamental design tension---activating clipping requires low $V$ , but low $V$ destroys smoothing---cannot be resolved through parameter tuning, ruling out innovation saturation as a robustification strategy for Kalman-based policy optimization.

Resources

← Back to Deployment live_20260213