Selective Delexicalization to Defend Structured-Output LLM APIs from Control-Plane Jailbreaks
Abstract
Structured-output APIs enable reliable LLM integration through constrained decoding, but recent work reveals that JSON schema specifications create a new attack surface for control-plane jailbreaks. Attackers can embed harmful instructions in forced enum or const values, bypassing safety alignment by forcing models to output malicious content verbatim. We propose Selective DeLex-JSON, a training-free defense that sanitizes JSON schemas before constrained decoding. Our approach identifies suspicious forced literals using a conjunction-based heuristic (flagging values that are long, contain whitespace, or include imperative verbs) and replaces them with opaque placeholders that preserve schema structure while removing attack payloads. On HarmBench, DeLex-JSON achieves 0% attack success rate on both Llama-3.1-8B and Qwen2.5-7B, completely neutralizing the EnumAttack that achieves 22% ASR without defense. The defense incurs only 1.1% benign schema modification rate, Pareto-dominating the Reject-Only baseline which has 4.4% rejection rate. The defense is immediately deployable in production inference pipelines without model retraining.