Misalign@k: Tail-Risk Evaluation of Emergent Misalignment Defenses Under Repeated Sampling
Abstract
Emergent misalignment---where fine-tuning on narrow tasks induces broadly misaligned behaviors---poses a significant safety concern for large language models. While recent defenses such as KL regularization and data interleaving reduce mean misalignment rates, current evaluations may underestimate deployment risk where users can sample multiple responses. We introduce Misalign@k, a tail-risk evaluation protocol that measures the fraction of prompts yielding at least one misaligned output across samples, with dual-scoring (alignment and coherence) enabling sensitivity analysis across labeling criteria. Evaluating emergent misalignment defenses on Qwen2.5-7B-Instruct, we find that tail-risk amplification is dramatic: Misalign@32 is 3.4 to 24.2 higher than mean rates. Critically, defense rankings flip depending on labeling choices---interleaving appears best under standard metrics (Misalign@32=16.67%) but becomes worst under relaxed metrics (73.61%) due to high incoherence rates masking underlying misalignment. These findings demonstrate that deployment decisions require both tail-risk evaluation and sensitivity analysis to avoid conclusions dependent on arbitrary methodological choices.