Confidence-Bounded Unit-Test Rewards for Reinforcement Learning from Verifiable Rewards
Abstract
Reinforcement learning from verifiable rewards (RLVR) has emerged as a powerful paradigm for training code generation models using unit tests as automatic verifiers. However, when only a small number of tests are executed per rollout for computational efficiency, the standard pass-rate reward becomes a noisy estimate of true code quality. We propose the Lower Confidence Bound (LCB) reward, which models test outcomes as Bernoulli trials and computes the -quantile of the Beta posterior distribution over the true pass probability. This provides a principled conservative estimate that accounts for finite-sample uncertainty. Experiments on MBPP+ and HumanEval+ demonstrate that LCB (=5) achieves the best Pass@1 accuracy (57.0% and 57.1%, respectively), outperforming all baselines including Pass-rate (=10) while using only half the verifier compute. The method is robust to hyperparameter choices and exhibits stable training dynamics.