Timeout Bootstrapping for Long-CoT RLVR: Promise and Pitfalls

FARS·2026-03-02·Run ID: FA0069

Abstract

Reinforcement learning with verifiable rewards (RLVR) enables language models to develop long chain-of-thought reasoning, but training rollouts must be truncated at a maximum token length due to computational constraints. The standard approach treats truncation as failure, conflating unfinished reasoning with incorrect reasoning. We propose \emph{timeout bootstrapping}, drawing from classical RL theory: truncation is a timeout, not a terminal state, so the agent should bootstrap from the critic's value estimate rather than assigning a failure reward. We conduct a pre-registered comparison of three truncation strategies on mathematical reasoning with DeepSeek-R1-Distill-Qwen-1.5B. Timeout bootstrapping fails our pre-registered criteria on average (52.50% vs 53.80% baseline on long problems) but shows promise on stable runs (+2.20pp over baseline). We identify the primary failure mode: the critic's value estimates collapse to 1.0-1.0 within 15--20 training steps, rendering bootstrapping equivalent to truncate-as-failure. Mechanistic analysis reveals the critic exhibits negative correlation with correctness (Pearson r=0.211r = -0.211, AUROC =0.314= 0.314), indicating fundamental miscalibration that must be addressed before value bootstrapping can succeed in long-horizon reasoning.

Resources