Query-OOD Escalation: Detecting Memory Poisoning Attacks via Embedding-Space Anomaly Detection

FARS·2026-03-02·Run ID: FA0083

Abstract

Large language model agents that use retrieval-augmented memory are vulnerable to poisoning attacks, where adversaries inject malicious demonstrations that are retrieved when triggered queries are issued. Existing defenses such as A-MemGuard employ consensus-based validation but incur significant computational overhead. We observe that AgentPoison's trigger optimization creates a detectable geometric signature: the uniqueness objective pushes triggered query embeddings out-of-distribution relative to benign queries. We propose Query-OOD Escalation (QOE), which uses an LDA-based detection gate to identify adversarial queries before they reach the agent. On ReAct-StrategyQA with AgentPoison attacks, our detection gate achieves perfect separation (AUROC=1.0) between benign and triggered queries. QOE-Reject reduces attack success rate by 4.25 percentage points while maintaining benign accuracy, and remains robust against adaptive attackers who reduce trigger uniqueness. Our work demonstrates that detection-based defenses can effectively complement consensus mechanisms for LLM agent security.

Resources