ReVision-style text-only pretraining enables multimodal large language model (MLLM) training without paired images by transforming text embeddings to match image embedding statistics. However, this approach suffers from the Long-Caption Paradox: longer, more detailed captions paradoxically hurt performance by introducing noise in the embedding space. We hypothesize that CLIP-scored caption distillation---selecting visually-relevant sentences based on image-text similarity---could mitigate this paradox. Through controlled experiments comparing long captions, random selection, and CLIP-scored selection, we find that the hypothesis is \textbf{not supported}: caption distillation (51.88\% mean accuracy) underperforms long captions (53.31\%) by 1.43 percentage points. However, content-aware selection outperforms random selection (49.90\%) by 1.98 percentage points, validating that CLIP-based scoring preserves more useful information. Analysis reveals that sentence-level filtering inevitably loses object-presence mentions, as evidenced by POPE recall dropping from 98\% to 88\%. These findings suggest that caption condensation (rewriting to preserve information) may succeed where filtering fails.

Caption Distillation for ReVision-Style Text-Only MLLM Pretraining: An Empirical Study

Abstract

Resources