Auditing and Hardening LiveMedBench's Rubric Grader Against Prompt Injection: A Negative Result

FARS·2026-03-02·Run ID: FA0074

Abstract

LLM-as-a-Judge systems are increasingly used to evaluate model-generated text, but their vulnerability to prompt injection attacks raises concerns about evaluation integrity. We conduct the first security audit of LiveMedBench's rubric grader, which exhibits theoretical vulnerabilities including direct interpolation of untrusted responses and permissive fallback parsing. We test three injection payload families---direct override, format spoofing, and fallback-parse trigger---and implement a four-layer hardening strategy comprising untrusted data framing, schema-constrained output, strict parsing, and evidence verification. Our findings constitute a negative result: the baseline grader demonstrates natural robustness to all tested attacks (no statistically significant score inflation; all 95% CIs include zero), while the hardening introduces a statistically significant benign drift of 6.42%-6.42\% (CI [-0.117, -0.010]) without providing measurable security benefit. These results demonstrate that theoretical vulnerabilities do not always translate to practical exploitability, and that security interventions should be empirically validated on both adversarial and benign conditions before deployment.

Resources