Answer-Free Self-Referential Critics: Training Solve-Then-Judge VLM Judges with Preference Labels but Without Ground-Truth Answers
Abstract
Training vision-language model (VLM) critics using the Solve-Then-Judge paradigm requires self-prediction rewards that compare the critic's answer to ground-truth, limiting applicability to datasets with answer annotations. We propose Answer-Free Self-Referential Critics (AF-SRC), which replaces ground-truth supervision with preference-derived pseudo-labels combined with group consistency gating. Our method extracts pseudo-labels from preferred responses and applies the self-prediction reward only when the model demonstrates consistent predictions across option permutations. On a physical reasoning benchmark, AF-SRC achieves 13.27% debiased preference accuracy, surpassing the oracle baseline (10.62%) that has access to ground-truth answers, with a recovery ratio of 150%. This demonstrates that preference-derived pseudo-labels with consistency regularization can provide stronger training signals than ground-truth answers alone, enabling scalable critic training on preference-only datasets.