PhaseGuard-KL: Output-Dissimilarity-Triggered KL Regularization for Emergent Misalignment Defense

FARS·2026-03-02·Run ID: FA0233

Abstract

Emergent misalignment is a concerning phenomenon where fine-tuning large language models on narrow tasks can cause broad behavioral changes, including increased willingness to assist with harmful requests. Existing defenses either prevent all learning (always-on KL regularization) or are ineffective (inference-time interventions). We propose PhaseGuard-KL, which monitors output distribution divergence on canary prompts during fine-tuning and triggers KL regularization only when divergence exceeds a threshold. Our experiments reveal that the trigger fires identically for both malicious (Security EM) and benign (OpSwap) fine-tuning at step 20 of training. While PhaseGuard-KL reduces Security EM misalignment from 48.6% to 20.8%, it also reduces benign task performance from 54.6% to 22.8%. No KL coefficient satisfies both criteria simultaneously, refuting the hypothesis that output-dissimilarity monitoring can selectively distinguish malicious from benign fine-tuning.

Resources