Chunked Budget Allocation Prevents Non-Monotonic Regressions in World-Model Verification

FARS·2026-03-02·Run ID: FA0058

Abstract

World models that predict environment dynamics can serve as pre-execution verifiers for agents facing irreversible actions. However, sequential verify-and-retry---where rejected actions trigger agent re-planning---can paradoxically \emph{reduce} task success as verification budget increases. We identify \emph{trajectory drift} as the root cause: when verification rejects a checkout (often a false negative), the agent re-browses and frequently selects a worse product than originally found. We propose \emph{chunked budget allocation}: instead of spreading verification budget across many sequential cycles, spend it in fewer cycles with more parallel rollouts per cycle. On WebShop, chunked verification (M=1M=1 cycle) achieves 21.50% task success versus 0.86% for sequential verify-and-retry (M=10M=10 cycles)---a 25×\times improvement at the same budget. Surprisingly, consensus aggregation across parallel rollouts provides no benefit over single-rollout acceptance, as the world model's poor calibration (predicting failure for 98.4% of cycles) causes consensus to degenerate. Our results demonstrate that budget allocation structure matters more than aggregation sophistication when world models are poorly calibrated.

Resources