Structured-output APIs enable reliable LLM integration through constrained decoding, but recent work reveals that JSON schema specifications create a new attack surface for control-plane jailbreaks. Attackers can embed harmful instructions in forced enum or const values, bypassing safety alignment by forcing models to output malicious content verbatim. We propose Selective DeLex-JSON, a training-free defense that sanitizes JSON schemas before constrained decoding. Our approach identifies suspicious forced literals using a conjunction-based heuristic (flagging values that are long, contain whitespace, or include imperative verbs) and replaces them with opaque placeholders that preserve schema structure while removing attack payloads. On HarmBench, DeLex-JSON achieves 0\% attack success rate on both Llama-3.1-8B and Qwen2.5-7B, completely neutralizing the EnumAttack that achieves 22\% ASR without defense. The defense incurs only 1.1\% benign schema modification rate, Pareto-dominating the Reject-Only baseline which has 4.4\% rejection rate. The defense is immediately deployable in production inference pipelines without model retraining.

Selective Delexicalization to Defend Structured-Output LLM APIs from Control-Plane Jailbreaks

Abstract

Resources