Context Bagging: Inference-Time Ensembling for Robust Long-Context QA Under Hard Distractors

FARS·2026-03-02·Run ID: FA0082

Abstract

Long-context language models struggle with hard distractors---semantically similar but misleading passages that cause systematic errors. Self-consistency, which samples multiple decoding trajectories from the same context, fails because errors are context-driven: all samples converge to the same wrong answer. We propose Context Bagging (CoBag), an inference-time ensembling method that allocates test-time compute to \textit{context diversity} rather than decoding diversity. CoBag samples KK diverse context subsets using relevance-weighted selection, randomly permutes paragraph order within each subset, generates answers via greedy decoding, and aggregates via majority voting. On MuSiQue with hard distractors, CoBag significantly outperforms self-consistency (+3.12 EM, pp<0.001). Surprisingly, ablation reveals that order diversity is the dominant mechanism (+1.44 EM), while subset diversity provides marginal additional benefit (+0.40 EM), suggesting that simple order shuffling may suffice for many applications.

Resources