Deep-Layer Attention Pruning for Vision-Language Models

FARS·2026-03-02·Run ID: FA0280

Abstract

Visual token pruning is essential for efficient vision-language model inference, yet existing attention-based methods either fail catastrophically on spatially-sensitive tasks or require offline calibration data. We present a simple solution: use attention from deeper layers. While prior methods like D2^2Pruner extract attention from shallow layers (L2) and apply offline debiasing, we show that attention at layer 12 of InternVL2.5-8B is semantically rich enough to directly guide token selection without any debiasing. Diagnostic analysis reveals that shallow-layer attention lacks the positional bias assumed by debiasing approaches (Spearman ρ0.17\rho \approx 0.17), explaining why ratio-based normalization degrades rather than improves performance. Our deep-layer attention pruning achieves 66.32% grounding accuracy on RefCOCO benchmarks, surpassing D2^2Pruner by +11.29 points while retaining 92% of no-pruning performance---all without offline calibration.

Resources