Time-Varying Mutual Information Decoding for Mitigating Visual Forgetting in Vision-Language Models
Abstract
Long chain-of-thought (CoT) reasoning has substantially improved vision-language model (VLM) performance on complex visual tasks. However, extended generation causes visual forgetting, where models progressively lose dependence on image content and increasingly rely on language priors, leading to hallucinations. We propose time-varying mutual information (MI) decoding, a training-free inference-time method that counteracts this phenomenon by amplifying the difference between image-conditioned and image-masked token distributions. Our key insight is that correction strength should increase over generation steps to match the progressive nature of visual forgetting. The method applies adaptively based on prediction confidence, avoiding interference with high-confidence tokens. On VLAA-Thinker-7B, our approach achieves 67.76% on HallusionBench (+1.51pp over vanilla) while maintaining reasoning capability on MMStar (62.07%). PDM-H trajectory analysis confirms that MI decoding slows the decay of visual information reliance. The method generalizes across architectures, improving Qwen2.5-VL-7B-Instruct by +3.34pp on HallusionBench.