Distance-Hiding Fingerprints for Text Embeddings via Secure SimHash
Abstract
Binary fingerprints from SimHash enable efficient similarity search but leak distance information through their collision probability curves, allowing attackers to estimate pairwise similarities from fingerprint comparisons. Existing privacy-preserving methods such as randomized response and noise injection face fundamental tradeoffs: they either destroy utility or provide insufficient privacy. We propose Secure SimHash, which applies -composition with XOR to flatten collision curves for non-neighbors while preserving near-neighbor detection. The collision probability approaches 0.5 for low-similarity pairs as increases, making distance estimation uninformative. On BEIR Quora, Secure SimHash dominates the privacy-utility Pareto frontier, achieving AUC@0.5=0.463 (near random-guess) at Recall@10=0.780, significantly outperforming randomized response and noise injection baselines. Ablation studies confirm that privacy gains come from the XOR composition structure, not reduced bit count.