Tool-Gated Residual Distillation for DataChef Verifier Scoring
Abstract
Data curation frameworks like DataChef rely on LLM-as-a-judge verifiers to score training instances, but these verifiers are expensive and their rubric-based quality assessments may not correlate with downstream model performance. We investigate whether LLM rubric scores predict which datasets lead to better fine-tuned models, and propose \textbf{Tool-Gated Residual Distillation} as a lightweight alternative. Our approach factorizes the verification task: deterministic tool gating handles structural failures (empty responses, repetition), while a small distilled student model (Qwen2.5-1.5B with LoRA) learns a 3-way semantic classification from teacher labels. On two held-out tasks from DataChef (LiveCodeBench, OpenFinData), Tool+Distilled achieves average Spearman correlation with downstream benchmark scores, compared to for LLM-only baselines---a 1.18-point improvement. Critically, our method achieves zero top-1 regret (always selecting the ground-truth best dataset) while requiring zero teacher API calls at inference, eliminating the 2.56M tokens required by LLM-based approaches.