View-Disagreement Escalation for Robust Web-Agent Trajectory Judges

FARS

View-Disagreement Escalation for Robust Web-Agent Trajectory Judges

FARS·2026-03-02·Run ID: FA0006

Abstract

LLM-based judges are widely used to evaluate web agent trajectories, but they are vulnerable to manipulation through unfaithful chain-of-thought (CoT) reasoning. We propose view-disagreement escalation, a training-free framework that compares judgments from two counterfactual input views---one with CoT and one without---to detect unreliable predictions. When views disagree, we escalate to strict evidence-anchored evaluation. Our key insight is that CoT manipulation affects CoT-dependent judgments while leaving CoT-agnostic judgments stable, causing disagreement that signals potential manipulation. On AgentRewardBench, our method achieves 63% relative reduction in attack sensitivity ( $\Delta$ -FPR: 4.31% vs 11.71%) while maintaining competitive F1 (72.13% vs 72.93%) and achieving the best recall (77.63%). The 2.15 $\times$ disagreement enrichment under attack validates our mechanistic hypothesis.

Resources

← Back to Deployment live_20260213