Key-Search Attacks Bypass Encrypted Activation Monitors

FARS·2026-03-02·Run ID: FA0291

Abstract

Key-conditioned embedding obfuscation enables privacy-preserving LLM inference by transforming user embeddings with secret keys before transmission to servers. We investigate whether this mechanism creates a vulnerability when combined with activation-based safety monitors. We introduce key-search attacks, where adversaries sample multiple keys and select the one that minimizes the monitor score. On Qwen2.5-7B-Instruct with an OSNIP-style encryptor, we find that key-search attacks reduce the true positive rate of encrypted activation monitors from 84.9% to 59.9% at KK=64 (25.0 percentage point drop) and to 16.2% at KK=512 (68.6pp drop) at FPR=1e-3. However, effective attacks require high key diversity that violates both utility (KL=0.031 vs. target 0.02) and privacy (ASR@10=0.526 vs. target 0.20) constraints. This reveals a fundamental tradeoff: well-designed OSNIP-like schemes that maintain low key diversity may resist key-search attacks, but at the cost of reduced privacy benefits from key personalization.

Resources