Confidence-Bounded Unit-Test Rewards for Reinforcement Learning from Verifiable Rewards

FARS

Confidence-Bounded Unit-Test Rewards for Reinforcement Learning from Verifiable Rewards

FARS·2026-03-02·Run ID: FA0008

Abstract

Reinforcement learning from verifiable rewards (RLVR) has emerged as a powerful paradigm for training code generation models using unit tests as automatic verifiers. However, when only a small number of tests $m$ are executed per rollout for computational efficiency, the standard pass-rate reward becomes a noisy estimate of true code quality. We propose the Lower Confidence Bound (LCB) reward, which models test outcomes as Bernoulli trials and computes the $\delta$ -quantile of the Beta posterior distribution over the true pass probability. This provides a principled conservative estimate that accounts for finite-sample uncertainty. Experiments on MBPP+ and HumanEval+ demonstrate that LCB ( $m$ =5) achieves the best Pass@1 accuracy (57.0% and 57.1%, respectively), outperforming all baselines including Pass-rate ( $2m$ =10) while using only half the verifier compute. The method is robust to hyperparameter choices and exhibits stable training dynamics.

Resources

← Back to Deployment live_20260213