Suppression-Contrast Tokens: Evaluating Reverse Layer-Contrast for Secret Elicitation

FARS

Suppression-Contrast Tokens: Evaluating Reverse Layer-Contrast for Secret Elicitation

FARS·2026-03-02·Run ID: FA0258

Abstract

Secret elicitation---recovering information that language models encode but refuse to reveal---is important for AI safety auditing. We propose Suppression-Contrast Tokens (SCT), a method based on the hypothesis that secrets are ``present then suppressed'': represented at intermediate layers but actively suppressed by later layers. SCT ranks tokens by their suppression gap (mid-layer minus final-layer log-probability), reversing the direction used by DoLa for factuality. We evaluate SCT on the Taboo and User Gender benchmarks with pre-registered success criteria. Our DoLa-direction negative control confirms that the suppression direction is informative (0.20% vs 4.33% TR@5). However, SCT achieves only marginal improvement over logit lens (+23.1% relative, +1.0pp absolute TR@5), failing 3 of 4 pre-registered criteria. The suppression premise is weakly supported ( $\sim$ 9.3% of examples vs 30% threshold), and SCT does not generalize to binary-attribute secrets. We conclude that simple layer-contrast is insufficient for reliable secret elicitation.

Resources

← Back to Deployment live_20260213