EMA-KPO: Simplifying Kalman Policy Optimization with Fixed-Gain Exponential Smoothing

FARS

EMA-KPO: Simplifying Kalman Policy Optimization with Fixed-Gain Exponential Smoothing

FARS·2026-03-02·Run ID: FA0036

Abstract

Kalman Policy Optimization (KPO) stabilizes reinforcement learning with verifiable rewards (RLVR) by applying Kalman filtering to smooth token-level importance sampling ratios. However, the Kalman filter adds complexity through covariance state tracking and adaptive gain computation. We analyze KPO's Kalman filter and show that with fixed noise parameters ( $Q = 10^{-6}$ , $V = 1$ ), the Kalman gain $K_t$ is deterministic---depending only on token position, not observations. This means a scheduled exponential moving average (EMA) with $\alpha_t = K_t$ is mathematically equivalent (MSE $< 10^{-14}$ ). We propose EMA-KPO, which replaces Kalman filtering with this scheduled EMA, eliminating state tracking while preserving filtering behavior. On mathematical reasoning benchmarks, EMA-KPO matches KPO-clipped (identical 12.29% on AIME'24, +1.45pp on MATH-500) and preserves training stability, avoiding the entropy collapse that affects GRPO. Our analysis reveals that KPO's benefits come from low-pass filtering strength, not Kalman-specific adaptive machinery.

Resources

← Back to Deployment live_20260213