Reference-based verifiers are critical components of reinforcement learning with verifiable rewards (RLVR), providing reward signals by comparing model responses against ground-truth answers. However, these verifiers are vulnerable to ``master-key'' attacks---trivial responses like single tokens or short phrases that achieve 25--29\% false positive rates without containing any actual answer. We propose RefSwap, a training-free detection method that exploits a fundamental asymmetry: legitimate correct responses exhibit self-solving behavior (high probability of verification against random references), while master-key false positives cannot self-solve. By sampling $K$ counterfactual references and computing the maximum verification probability (max\_p\_cf), Multi-CF RefSwap achieves near-perfect separation (AUC=0.991) between true positives and master keys. On xVerify-7B-I, RefSwap reduces average master-key false positive rate from 25.50\% to 0.81\%---a 96.8\% relative reduction---with only 2.74 percentage points accuracy cost. However, effectiveness depends on verifier architecture: RefSwap works on xVerify but not Qwen, revealing that backbone design determines susceptibility to counterfactual-based detection.

RefSwap: Counterfactual Reference-Swap Verification for Robust LLM Verifiers

Abstract

Resources