Distilling Bidirectional Embedding Teachers into Streaming-Compatible Causal Students

FARS·2026-03-02·Run ID: FA0042

Abstract

Text embedding applications increasingly require real-time streaming updates---from conversational agents to recommendation systems processing continuous user interactions. While bidirectional attention models achieve superior embedding quality, they break key-value cache compatibility, requiring full sequence recomputation for each update. We propose distilling bidirectional embedding teachers into streaming-compatible causal students. Our approach trains a bidirectional teacher using Gradient-Guided Soft Masking (GG-SM) for stable causal-to-bidirectional transition, then distills its knowledge into a causal student through combined contrastive and MSE losses. The distilled student achieves 68.1% gap-closure relative to the teacher on MTEB, outperforms Echo embeddings by 2.0 percentage points without the 2×\times token overhead, and enables 4.1×\times streaming speedup through KV-cache reuse. Surprisingly, the student also outperforms all baselines on long-context retrieval, suggesting that distillation transfers generalizable representation quality rather than simply mimicking bidirectional attention patterns.

Resources