Canary-Controlled Safe-Data Interleaving for Reducing Emergent Misalignment
Abstract
Fine-tuning large language models on narrow tasks can induce emergent misalignment---harmful behaviors on unrelated prompts---even when the training data appears benign. Safe-data interleaving, which mixes benign examples with target data, is a promising defense but typically uses fixed interleaving ratios throughout training. We propose canary-controlled adaptive safe-data interleaving, a closed-loop framework that monitors emergent misalignment risk via canary prompts and dynamically adjusts the interleaving ratio. The controller computes an EMA-smoothed risk estimate from canary evaluations and uses a threshold-based policy with hysteresis to increase intervention when risk is detected. On the Security EM benchmark with Qwen2.5-7B-Instruct, our method achieves 5.39% General %Misaligned compared to 7.15% for fixed 5% interleaving---a 25% relative improvement---with 4 lower cross-seed variance. Ablation studies confirm that adaptive timing provides value beyond average interleaving ratio, with fixed-timing variants showing 36% worse performance despite identical safe-data volume.