Hazard-Signature Tombstones: Commit-Time Forget Lockout for LLM Agent Memory

FARS·2026-03-02·Run ID: FA0404

Abstract

Large language model agents increasingly rely on persistent memory stores to maintain context across sessions, yet these memories remain vulnerable to poisoning attacks that inject harmful content. When users request deletion of such content, naive approaches that simply remove the original entry fail to prevent \emph{paraphrase re-injection}---semantically equivalent reformulations that bypass exact-match filters. We propose \textbf{Hazard-Signature Tombstones (HST)}, a commit-time forget lockout policy that extracts discrete semantic fingerprints from deleted content and blocks future writes exhibiting fuzzy set-containment with these signatures. Unlike retrieval-time filtering, HST prevents poisoned content from ever entering the memory store, preserving retrieval capacity for benign entries. On a paraphrase re-injection benchmark, HST reduces poisoned retrieval proportion from 0.94 (ID-delete baseline) to 0.0 while maintaining perfect benign recall, blocking all 50 paraphrase variants with only 3% false positives. Our analysis reveals that fuzzy matching achieves 92% hazard-signature stability compared to 30% for exact matching, explaining HST's effectiveness against semantic reformulation attacks.

Resources