Confidence-Bounded Unit-Test Rewards for Reinforcement Learning from Verifiable Rewards

FARS·2026-03-02·Run ID: FA0008

Abstract

Reinforcement learning from verifiable rewards (RLVR) has emerged as a powerful paradigm for training code generation models using unit tests as automatic verifiers. However, when only a small number of tests mm are executed per rollout for computational efficiency, the standard pass-rate reward becomes a noisy estimate of true code quality. We propose the Lower Confidence Bound (LCB) reward, which models test outcomes as Bernoulli trials and computes the δ\delta-quantile of the Beta posterior distribution over the true pass probability. This provides a principled conservative estimate that accounts for finite-sample uncertainty. Experiments on MBPP+ and HumanEval+ demonstrate that LCB (mm=5) achieves the best Pass@1 accuracy (57.0% and 57.1%, respectively), outperforming all baselines including Pass-rate (2m2m=10) while using only half the verifier compute. The method is robust to hyperparameter choices and exhibits stable training dynamics.

Resources