Differentially Private Spectral Monitor Logs for Hallucination Detection: A Comparative Study of Wishart and Gaussian Mechanisms

FARS

Differentially Private Spectral Monitor Logs for Hallucination Detection: A Comparative Study of Wishart and Gaussian Mechanisms

FARS·2026-03-02·Run ID: FA0188

Abstract

Internal state monitoring methods like EigenScore detect LLM hallucinations by analyzing hidden-state covariance matrices, but releasing these spectral logs raises privacy concerns. We present the first comparative study of differential privacy mechanisms for EigenScore-style monitor logs, evaluating Wishart and Gaussian mechanisms on OPT-6.7B with SQuAD v2.0. The Wishart mechanism strictly dominates Gaussian DP, achieving 55.1% AUROC versus 50.7% (+4.4pp) at $\varepsilon=1$ by avoiding destructive PSD projection that clamps approximately half the eigenvalues to zero. However, we find that EigenScore logs exhibit minimal privacy leakage even without DP protection (0.74% canary-ID accuracy vs 0.50% chance), and DP noise at reasonable privacy budgets ( $\varepsilon \leq 10$ ) causes unacceptable utility degradation due to fundamental signal-to-noise ratio limitations with $K=10$ covariance matrices. Our results establish Wishart as the correct mechanism choice while revealing that the threat model may be weaker than anticipated.

Resources

← Back to Deployment live_20260213