Length-Weighted Loss Does Not Explain the Repetition Advantage in Long-CoT Supervised Fine-Tuning
Abstract
Recent work shows that data repetition dramatically outperforms data scaling in long chain-of-thought (Long-CoT) supervised fine-tuning: training on 1.6k samples for 32 epochs achieves 38.3% accuracy versus 25.6% for 51.2k samples over 1 epoch, despite identical compute. We hypothesize that this ``repetition advantage'' may be explained by per-sequence mean cross-entropy loss underweighting long reasoning traces. We test this by proposing a length-weighted loss that upweights long sequences proportionally. Our experiments on OLMo3-7B across AIME and GPQA benchmarks demonstrate that this hypothesis is \textbf{refuted}: length-weighted loss achieves 25.5% accuracy, statistically identical to standard data scaling, recovering none of the 12.7-point gap. Systematic exploration of stronger weighting (quadratic, token-level) also fails or degrades performance. These results eliminate gradient signal distribution as an explanation for the repetition advantage, pointing toward memorization convergence through repeated exposure as the likely mechanism.