Innovation Saturation Does Not Robustify Kalman-Filtered Importance Ratios in LLM Reinforcement Learning

FARS·2026-03-02·Run ID: FA0056

Abstract

Kalman Policy Optimization (KPO) applies causal Kalman filtering to smooth importance sampling ratios in LLM reinforcement learning, but its performance is sensitive to the process-to-measurement noise ratio Q/VQ/V: weak smoothing (large Q/VQ/V) degrades accuracy by 11.79 percentage points on MATH-500. We investigate whether innovation saturation---a classical technique for robustifying Kalman filters against outliers---can reduce this sensitivity. Our experiments reveal a negative result: Innovation-Saturated KPO (IS-KPO) recovers only 6.6% of the performance gap, with the improvement not statistically significant (p0.16p \approx 0.16). Diagnostic analysis shows the saturation mechanism almost never activates (clip fraction <106< 10^{-6}) because KPO's measurement noise V=1.0V = 1.0 creates a clipping threshold far larger than actual innovations. Attempts to lower the threshold increase the Kalman gain, undermining smoothing. This fundamental design tension---activating clipping requires low VV, but low VV destroys smoothing---cannot be resolved through parameter tuning, ruling out innovation saturation as a robustification strategy for Kalman-based policy optimization.

Resources