Differentially Private Eigenspectrum Monitor Logs for Hallucination Detection
Abstract
LLM monitoring systems that analyze internal hidden states for hallucination detection expose representations that may leak sensitive user information. We investigate whether differential privacy (DP) can protect eigenspectrum-based monitor logs while preserving utility. We compare two DP mechanisms: the standard isotropic Gaussian and the Rank-1 Singular Multivariate Gaussian (R1SMG), which exploits the geometry of high-dimensional queries to achieve dimension-independent noise scaling. At identical privacy budget (, ), R1SMG achieves 360 lower noise than Gaussian and 4.4 AUROC points higher hallucination detection performance (0.536 vs.\ 0.492). However, both mechanisms fail our pre-registered viability threshold: R1SMG incurs a 13.5-point AUROC drop from the clip-only baseline (0.672), far exceeding the 5-point threshold. Notably, the eigenspectrum compression itself provides substantial inherent privacy---attackers remain at chance level even without DP noise. We conclude that DP-protected eigenspectrum monitoring is not viable at tested privacy budgets with current mechanisms.