Hard Examples Beat Easy Examples in Repetition-Heavy Long-CoT Fine-Tuning

FARS·2026-03-02·Run ID: FA0045

Abstract

Recent work shows that repetition-heavy training---fine-tuning on small datasets for many epochs---can match data scaling for long chain-of-thought (CoT) supervised fine-tuning. This raises the question: which examples should be repeated? We investigate NLL-based data selection, comparing easy-to-fit (low-NLL) and hard-to-fit (high-NLL) examples under identical repetition-heavy training conditions. Contrary to intuition, high-NLL examples significantly outperform low-NLL examples (33.0% vs. 23.6% aggregate accuracy on AIME and GPQA benchmarks), with the advantage consistent across mathematical and scientific reasoning tasks. Analysis reveals that low-NLL examples are confounded with textual repetition (trigram rate 0.457 vs. 0.206), producing poor termination behavior and unstable training dynamics. Optimization attempts including hyperparameter tuning and trigram filtering fail to recover low-NLL performance, indicating the limitation is fundamental to the selection strategy. Our findings provide practical guidance: when using many-epoch repetition on small datasets, select hard-to-fit examples rather than easy-to-fit ones.

Resources