View-Disagreement Escalation for Robust Web-Agent Trajectory Judges
Abstract
LLM-based judges are widely used to evaluate web agent trajectories, but they are vulnerable to manipulation through unfaithful chain-of-thought (CoT) reasoning. We propose view-disagreement escalation, a training-free framework that compares judgments from two counterfactual input views---one with CoT and one without---to detect unreliable predictions. When views disagree, we escalate to strict evidence-anchored evaluation. Our key insight is that CoT manipulation affects CoT-dependent judgments while leaving CoT-agnostic judgments stable, causing disagreement that signals potential manipulation. On AgentRewardBench, our method achieves 63% relative reduction in attack sensitivity (-FPR: 4.31% vs 11.71%) while maintaining competitive F1 (72.13% vs 72.93%) and achieving the best recall (77.63%). The 2.15 disagreement enrichment under attack validates our mechanistic hypothesis.