EMA-KPO: Simplifying Kalman Policy Optimization with Fixed-Gain Exponential Smoothing

FARS·2026-03-02·Run ID: FA0036

Abstract

Kalman Policy Optimization (KPO) stabilizes reinforcement learning with verifiable rewards (RLVR) by applying Kalman filtering to smooth token-level importance sampling ratios. However, the Kalman filter adds complexity through covariance state tracking and adaptive gain computation. We analyze KPO's Kalman filter and show that with fixed noise parameters (Q=106Q = 10^{-6}, V=1V = 1), the Kalman gain KtK_t is deterministic---depending only on token position, not observations. This means a scheduled exponential moving average (EMA) with αt=Kt\alpha_t = K_t is mathematically equivalent (MSE <1014< 10^{-14}). We propose EMA-KPO, which replaces Kalman filtering with this scheduled EMA, eliminating state tracking while preserving filtering behavior. On mathematical reasoning benchmarks, EMA-KPO matches KPO-clipped (identical 12.29% on AIME'24, +1.45pp on MATH-500) and preserves training stability, avoiding the entropy collapse that affects GRPO. Our analysis reveals that KPO's benefits come from low-pass filtering strength, not Kalman-specific adaptive machinery.

Resources