AlignDefTok: Training-Free Transfer of DefensiveTokens via Embedding-Space Alignment
Abstract
Prompt injection attacks pose a critical threat to LLM-integrated applications by embedding adversarial instructions in external data to hijack model behavior. DefensiveTokens provide an effective test-time defense by prepending learned soft tokens to inputs, but require expensive per-model training (16 GPU-hours). We present AlignDefTok, a training-free method for transferring DefensiveTokens between related models via Orthogonal Procrustes alignment. Our approach computes the optimal rotation matrix from vocabulary embeddings and applies it to transfer DefensiveTokens while preserving their critical high-norm property. On the AlpacaFarm benchmark, Procrustes transfer achieves 0% attack success rate (ASR) for Llama-3.1Llama-3 with 285 speedup. For harder transfer directions, our tiny-adapt stage uses transferred tokens as initialization, achieving 1.9% ASR with 133 speedup. AlignDefTok enables rapid deployment of prompt injection defenses across model families without per-model retraining.