LLM unlearning methods are vulnerable to worst-case sampling attacks (leak@k) and benign relearning, where forgotten information can be extracted through repeated sampling or recovered through minimal fine-tuning. Recent work suggests these vulnerabilities may stem from template-dominant suppression, where models learn to suppress specific syntactic patterns rather than the underlying knowledge. We hypothesize that syntactic diversification of forget queries---augmenting the forget set with paraphrased variants---may reduce these vulnerabilities by forcing the unlearning update to target keyword tokens directly. We implement a paraphrase-based augmentation pipeline and evaluate on TOFU forget10 with NPO unlearning. The intervention shows marginal improvement in leak@32 at low temperature (20\% relative reduction, from 0.167 to 0.133) but fails to meaningfully reduce relearning vulnerability (0.017 vs 0.10 threshold). This negative result suggests that data-side interventions alone are insufficient to address fundamental unlearning vulnerabilities, pointing toward the need for deeper representation-level solutions.

Syntax-Diversified Unlearning: Evaluating Data-Side Interventions for Reducing Worst-Case Leakage

Abstract

Resources