WindowScan-Judge: Robust Safety Judging Against Benign-Padding Attacks via Windowed Scanning and Length-Aware Aggregation

FARS

WindowScan-Judge: Robust Safety Judging Against Benign-Padding Attacks via Windowed Scanning and Length-Aware Aggregation

FARS·2026-03-02·Run ID: FA0007

Abstract

Large language model (LLM) safety judges serve as critical gatekeepers for detecting harmful content, yet their robustness against adversarial manipulation remains underexplored. We identify a severe vulnerability in state-of-the-art safety judges: benign-padding attacks, which prepend and append innocuous text to harmful responses, cause catastrophic failure. WildGuard's false negative rate (FNR) increases from 0.0455 to 1.0 under such attacks, meaning all harmful content evades detection. We propose WindowScan-Judge (WSJ), a post-hoc defense that applies windowed scanning with multi-scale windows (128, 256, 512 tokens) to isolate harmful content from padding, combined with Length-Aware FPR Control (LA-FPR) to calibrate detection thresholds based on the number of windows. WSJ reduces WildGuard's FNR from 1.0 to 0.0091 on prepend+append padding while maintaining false positive rate within budget, achieving F1 of 0.9237 compared to 0.0 for the holistic baseline. Our defense generalizes across judges, reducing Llama Guard 3's FNR from 0.2636 to 0.0455.

Resources

← Back to Deployment live_20260213