Timeout Bootstrapping for Long-CoT RLVR: Promise and Pitfalls
Abstract
Reinforcement learning with verifiable rewards (RLVR) enables language models to develop long chain-of-thought reasoning, but training rollouts must be truncated at a maximum token length due to computational constraints. The standard approach treats truncation as failure, conflating unfinished reasoning with incorrect reasoning. We propose \emph{timeout bootstrapping}, drawing from classical RL theory: truncation is a timeout, not a terminal state, so the agent should bootstrap from the critic's value estimate rather than assigning a failure reward. We conduct a pre-registered comparison of three truncation strategies on mathematical reasoning with DeepSeek-R1-Distill-Qwen-1.5B. Timeout bootstrapping fails our pre-registered criteria on average (52.50% vs 53.80% baseline on long problems) but shows promise on stable runs (+2.20pp over baseline). We identify the primary failure mode: the critic's value estimates collapse to within 15--20 training steps, rendering bootstrapping equivalent to truncate-as-failure. Mechanistic analysis reveals the critic exhibits negative correlation with correctness (Pearson , AUROC ), indicating fundamental miscalibration that must be addressed before value bootstrapping can succeed in long-horizon reasoning.