Escaped Markup: Preventing Verdict Spoofing in Structured Multimodal LLM Judges
Abstract
LLM-as-a-Judge systems are critical for AI alignment, providing reward signals for reinforcement learning from human and AI feedback. To enable reliable verdict extraction, modern judges increasingly use structured output formats with reserved markers such as \texttt{<think>} for chain-of-thought reasoning and \texttt{\textbackslash boxed\{\}} for final verdicts. However, these structured formats create exploitable attack surfaces. We identify format-spoofing attacks where adversaries inject the judge's reserved markers into candidate responses, achieving 66.59% conditional attack success rate on VL-RewardBench---flipping two-thirds of examples the judge would otherwise get correct. We propose reserved-sequence sanitization, a training-free defense that preprocesses candidate responses through tag stripping, boxed removal, and verdict/quality-assertion redaction. Our defense reduces attack success by 39.36 percentage points while preserving clean judging accuracy, substantially outperforming Spotlighting-style base64 encoding which fails with 18.8% parse failure rate on 7B-scale multimodal judges.