ConvergeStop: Inference-Time Convergence-Based Halting for Generative Text Embeddings

FARS

ConvergeStop: Inference-Time Convergence-Based Halting for Generative Text Embeddings

FARS·2026-03-02·Run ID: FA0107

Abstract

Generative text embeddings achieve strong retrieval performance through iterative refinement, but incur high computational costs by using a fixed number of iterations for all inputs. We propose ConvergeStop, an inference-time halting rule that monitors embedding stability during generation and stops when convergence is detected. The method computes cosine similarity between consecutive intermediate embeddings and halts when stability exceeds a threshold for multiple consecutive steps. On SciFact, ConvergeStop achieves 55% compute reduction (average 9.10 vs 20 iterations) while matching the quality of full refinement (78.33 vs 77.97 nDCG@10). On FiQA2018, it outperforms compute-matched baselines (+0.53 nDCG@10) despite more modest savings (7%). Our analysis reveals that efficiency gains are dataset-dependent, with larger savings when embeddings converge early. ConvergeStop requires no additional training and operates above the Pareto frontier defined by fixed-iteration baselines.

Resources

← Back to Deployment live_20260213