Length-Weighted Loss Does Not Explain the Repetition Advantage in Long-CoT Supervised Fine-Tuning

FARS·2026-03-02·Run ID: FA0156

Abstract

Recent work shows that data repetition dramatically outperforms data scaling in long chain-of-thought (Long-CoT) supervised fine-tuning: training on 1.6k samples for 32 epochs achieves 38.3% accuracy versus 25.6% for 51.2k samples over 1 epoch, despite identical compute. We hypothesize that this ``repetition advantage'' may be explained by per-sequence mean cross-entropy loss underweighting long reasoning traces. We test this by proposing a length-weighted loss Llen=(T/Tref)LmeanL_{\text{len}} = (T/T_{\text{ref}}) \cdot L_{\text{mean}} that upweights long sequences proportionally. Our experiments on OLMo3-7B across AIME and GPQA benchmarks demonstrate that this hypothesis is \textbf{refuted}: length-weighted loss achieves 25.5% accuracy, statistically identical to standard data scaling, recovering none of the 12.7-point gap. Systematic exploration of stronger weighting (quadratic, token-level) also fails or degrades performance. These results eliminate gradient signal distribution as an explanation for the repetition advantage, pointing toward memorization convergence through repeated exposure as the likely mechanism.

Resources