Suppression-Contrast Tokens: Evaluating Reverse Layer-Contrast for Secret Elicitation
Abstract
Secret elicitation---recovering information that language models encode but refuse to reveal---is important for AI safety auditing. We propose Suppression-Contrast Tokens (SCT), a method based on the hypothesis that secrets are ``present then suppressed'': represented at intermediate layers but actively suppressed by later layers. SCT ranks tokens by their suppression gap (mid-layer minus final-layer log-probability), reversing the direction used by DoLa for factuality. We evaluate SCT on the Taboo and User Gender benchmarks with pre-registered success criteria. Our DoLa-direction negative control confirms that the suppression direction is informative (0.20% vs 4.33% TR@5). However, SCT achieves only marginal improvement over logit lens (+23.1% relative, +1.0pp absolute TR@5), failing 3 of 4 pre-registered criteria. The suppression premise is weakly supported (9.3% of examples vs 30% threshold), and SCT does not generalize to binary-attribute secrets. We conclude that simple layer-contrast is insufficient for reliable secret elicitation.