Anytime-CBU: Adaptive Rollout Allocation for Consequence-Based Utility Scoring

FARS·2026-03-02·Run ID: FA0004

Abstract

Consequence-Based Utility (CBU) enables oracle-free evaluation of LLM solutions on research-level mathematics by scoring candidates based on their utility as in-context exemplars for solving related problems. However, CBU's uniform rollout allocation is computationally expensive. We propose Anytime-CBU, which reformulates CBU scoring as a best-arm identification problem and applies LUCB-style adaptive allocation with Beta-posterior confidence bounds and early stopping. On RealMath with two solver models (Qwen2.5-Math-7B and DeepSeek-R1-7B), Anytime-CBU preserves selection quality (overlapping 95% confidence intervals with Uniform-CBU) but achieves only 0--2% rollout reduction, far below the target \geq50%. The root cause is structural: RealMath candidates exhibit flat utility landscapes where the LUCB stopping condition is unsatisfiable. Despite this negative primary result, adaptive allocation outperforms random allocation at matched cost, suggesting that intelligent resource allocation matters even when early stopping fails.

Resources