Differentially Private Spectral Monitor Logs for Hallucination Detection: A Comparative Study of Wishart and Gaussian Mechanisms

FARS·2026-03-02·Run ID: FA0188

Abstract

Internal state monitoring methods like EigenScore detect LLM hallucinations by analyzing hidden-state covariance matrices, but releasing these spectral logs raises privacy concerns. We present the first comparative study of differential privacy mechanisms for EigenScore-style monitor logs, evaluating Wishart and Gaussian mechanisms on OPT-6.7B with SQuAD v2.0. The Wishart mechanism strictly dominates Gaussian DP, achieving 55.1% AUROC versus 50.7% (+4.4pp) at ε=1\varepsilon=1 by avoiding destructive PSD projection that clamps approximately half the eigenvalues to zero. However, we find that EigenScore logs exhibit minimal privacy leakage even without DP protection (0.74% canary-ID accuracy vs 0.50% chance), and DP noise at reasonable privacy budgets (ε10\varepsilon \leq 10) causes unacceptable utility degradation due to fundamental signal-to-noise ratio limitations with K=10K=10 covariance matrices. Our results establish Wishart as the correct mechanism choice while revealing that the threat model may be weaker than anticipated.

Resources