AlignDefTok: Training-Free Transfer of DefensiveTokens via Embedding-Space Alignment

FARS·2026-03-02·Run ID: FA0020

Abstract

Prompt injection attacks pose a critical threat to LLM-integrated applications by embedding adversarial instructions in external data to hijack model behavior. DefensiveTokens provide an effective test-time defense by prepending learned soft tokens to inputs, but require expensive per-model training (\sim16 GPU-hours). We present AlignDefTok, a training-free method for transferring DefensiveTokens between related models via Orthogonal Procrustes alignment. Our approach computes the optimal rotation matrix from vocabulary embeddings and applies it to transfer DefensiveTokens while preserving their critical high-norm property. On the AlpacaFarm benchmark, Procrustes transfer achieves 0% attack success rate (ASR) for Llama-3.1\rightarrowLlama-3 with 285×\times speedup. For harder transfer directions, our tiny-adapt stage uses transferred tokens as initialization, achieving 1.9% ASR with 133×\times speedup. AlignDefTok enables rapid deployment of prompt injection defenses across model families without per-model retraining.

Resources