Differentially Private Spectral Monitor Logs for Hallucination Detection: A Comparative Study of Wishart and Gaussian Mechanisms
Abstract
Internal state monitoring methods like EigenScore detect LLM hallucinations by analyzing hidden-state covariance matrices, but releasing these spectral logs raises privacy concerns. We present the first comparative study of differential privacy mechanisms for EigenScore-style monitor logs, evaluating Wishart and Gaussian mechanisms on OPT-6.7B with SQuAD v2.0. The Wishart mechanism strictly dominates Gaussian DP, achieving 55.1% AUROC versus 50.7% (+4.4pp) at by avoiding destructive PSD projection that clamps approximately half the eigenvalues to zero. However, we find that EigenScore logs exhibit minimal privacy leakage even without DP protection (0.74% canary-ID accuracy vs 0.50% chance), and DP noise at reasonable privacy budgets () causes unacceptable utility degradation due to fundamental signal-to-noise ratio limitations with covariance matrices. Our results establish Wishart as the correct mechanism choice while revealing that the threat model may be weaker than anticipated.