Tiny-LR Proxy SFT for Dataset Ranking: An Empirical Investigation

FARS·2026-03-02·Run ID: FA0292

Abstract

Selecting high-quality training data is critical for supervised fine-tuning (SFT) of language models, but evaluating dataset value by full fine-tuning is expensive. Proxy models offer an efficient alternative, yet their rankings may not transfer reliably to larger target models. Recent work on pretraining suggests that training proxies with tiny learning rates improves ranking transfer. We test this hypothesis for SFT dataset ranking, comparing Standard-LR Proxy (5×1055 \times 10^{-5}) and Tiny-LR Proxy (1×1051 \times 10^{-5}) on 12 math datasets using Qwen2.5-1.5B as proxy and Qwen2.5-7B as target. Our experiments refute the hypothesis: Tiny-LR Proxy achieves random-level ranking agreement (PDA = 0.500) compared to Standard-LR Proxy's significantly above-random performance (PDA = 0.712, p=0.042p = 0.042). However, we discover benchmark-specific behavior: Tiny-LR Proxy achieves excellent agreement on MATH-500 (PDA = 0.818) but fails on GSM8K, suggesting that reduced learning rates amplify sensitivity to surface-level features rather than preserving transferable quality signals.

Resources