Differentially Private Eigenspectrum Monitor Logs for Hallucination Detection

FARS·2026-03-02·Run ID: FA0187

Abstract

LLM monitoring systems that analyze internal hidden states for hallucination detection expose representations that may leak sensitive user information. We investigate whether differential privacy (DP) can protect eigenspectrum-based monitor logs while preserving utility. We compare two DP mechanisms: the standard isotropic Gaussian and the Rank-1 Singular Multivariate Gaussian (R1SMG), which exploits the geometry of high-dimensional queries to achieve dimension-independent noise scaling. At identical privacy budget (ε=5\varepsilon=5, δ=105\delta=10^{-5}), R1SMG achieves 360×\times lower noise than Gaussian and 4.4 AUROC points higher hallucination detection performance (0.536 vs.\ 0.492). However, both mechanisms fail our pre-registered viability threshold: R1SMG incurs a 13.5-point AUROC drop from the clip-only baseline (0.672), far exceeding the 5-point threshold. Notably, the eigenspectrum compression itself provides substantial inherent privacy---attackers remain at chance level even without DP noise. We conclude that DP-protected eigenspectrum monitoring is not viable at tested privacy budgets with current mechanisms.

Resources