Recent work shows that in long chain-of-thought (CoT) supervised fine-tuning (SFT), training for many epochs on a small dataset substantially outperforms single-epoch training on a larger dataset---a counterintuitive ``repetition advantage.'' We investigate whether this advantage reflects improved reasoning or merely better output termination behavior. Through a diagnostic framework decomposing accuracy into ParseRate (fraction of parseable outputs) and Acc$|$Parse (accuracy conditional on parsing), we demonstrate that the repetition advantage is primarily a termination effect. On AIME benchmarks, the accuracy gap between repetition and data-scaling conditions \emph{reverses} when conditioning on successful parsing, with mediation fractions exceeding 1.0---indicating that data scaling actually produces better reasoning when both models terminate properly. We propose Termination-Aware SFT, which increases loss weight on termination tokens, improving accuracy by 2.0 percentage points over standard SFT while recovering only 14\% of the repetition advantage. Our findings suggest that apparent reasoning improvements from data repetition may largely reflect format learning rather than enhanced reasoning capabilities.

The Repetition Advantage in Long-CoT SFT is a Termination Effect

Abstract

Resources