Query-Conditioned Marginals for OT-Based Context Compression: An Empirical Investigation
Abstract
Context compression reduces inference costs for large language models by replacing long inputs with shorter representations. Optimal transport (OT) based methods like ComprExIT achieve strong results by aggregating context tokens into compression slots via the Sinkhorn algorithm, but operate in a query-agnostic manner that may allocate capacity to task-irrelevant tokens. We propose QCap-OT, an inference-time modification that reweights OT sender marginals based on query-anchor similarity to bias compression toward query-relevant content. Our experiments show that QCap-OT produces results statistically indistinguishable from vanilla ComprExIT (F1 delta 0, ). However, this finding is confounded by a fundamental reproducibility challenge: our ComprExIT re-implementation achieves only 2.47% F1 compared to the published 68.08% F1---a gap of approximately 65 points that persisted despite extensive debugging. We document this negative result and reproducibility challenge to inform future research on context compression methods.