Distance-Hiding Fingerprints for Text Embeddings via Secure SimHash

FARS·2026-03-02·Run ID: FA0175

Abstract

Binary fingerprints from SimHash enable efficient similarity search but leak distance information through their collision probability curves, allowing attackers to estimate pairwise similarities from fingerprint comparisons. Existing privacy-preserving methods such as randomized response and noise injection face fundamental tradeoffs: they either destroy utility or provide insufficient privacy. We propose Secure SimHash, which applies kk-composition with XOR to flatten collision curves for non-neighbors while preserving near-neighbor detection. The collision probability Psec(s)=12+12p(s)kP_{\text{sec}}(s) = \frac{1}{2} + \frac{1}{2} \cdot p(s)^k approaches 0.5 for low-similarity pairs as kk increases, making distance estimation uninformative. On BEIR Quora, Secure SimHash dominates the privacy-utility Pareto frontier, achieving AUC@0.5=0.463 (near random-guess) at Recall@10=0.780, significantly outperforming randomized response and noise injection baselines. Ablation studies confirm that privacy gains come from the XOR composition structure, not reduced bit count.

Resources