Vision-language model (VLM) judges are increasingly used to evaluate AI-generated content, but their reliability under adversarial conditions remains understudied. Single-pass judging architectures expose the model to candidate responses before generating its assessment, creating a vulnerability to prompt injection attacks embedded in candidate content. We propose an isolated solve-then-judge defense that generates a self-answer from only trusted inputs (image and query) before judging candidates against this uncontaminated reference. On VL-RewardBench (N=1,247) with Qwen2.5-VL-7B-Instruct, our defense reduces conditional attack success rate from 91.3\% to 29.3\%, a 62 percentage-point reduction. Controlled experiments confirm that information isolation provides an additional 4pp defense beyond prompt engineering alone. However, the defense comes at a cost of 10.8pp clean accuracy degradation and shows category-dependent effectiveness, with hallucination detection benefiting most (74.8pp reduction) and reasoning tasks least (36.4pp). Authority impersonation attacks remain challenging, achieving 63.6\% success even against the defended system.

Isolated Solve-Then-Judge: A Simple Defense Against Candidate-Response Prompt Injection for Multimodal LLM Judges

Abstract

Resources