Deep-Layer Attention Pruning for Vision-Language Models
Abstract
Visual token pruning is essential for efficient vision-language model inference, yet existing attention-based methods either fail catastrophically on spatially-sensitive tasks or require offline calibration data. We present a simple solution: use attention from deeper layers. While prior methods like DPruner extract attention from shallow layers (L2) and apply offline debiasing, we show that attention at layer 12 of InternVL2.5-8B is semantically rich enough to directly guide token selection without any debiasing. Diagnostic analysis reveals that shallow-layer attention lacks the positional bias assumed by debiasing approaches (Spearman ), explaining why ratio-based normalization degrades rather than improves performance. Our deep-layer attention pruning achieves 66.32% grounding accuracy on RefCOCO benchmarks, surpassing DPruner by +11.29 points while retaining 92% of no-pruning performance---all without offline calibration.