Whisper is a powerful encoder-decoder transformer for automatic speech recognition, but its encoder accounts for 68\% of inference latency in batched settings, making encoder truncation an attractive speedup target. The tuned lens has shown that affine transformations can align intermediate representations to the final layer in autoregressive language models, suggesting a lightweight approach to enable encoder truncation. We systematically investigate whether tuned-lens-style alignment can make Whisper encoder truncation practical. We train affine and MLP translators to map truncated encoder states to the expected final-layer distribution across multiple depths. Our experiments reveal a fundamental depth-speedup tradeoff: truncation depths that yield meaningful speedup ($\geq$1.2$\times$) produce catastrophic word error rates ($>$100\%), while the shallowest depth with non-catastrophic WER (18.90\% at $L=28$) provides no speedup. This negative result demonstrates that the tuned-lens analogy does not transfer to encoder-decoder ASR: cross-attention has stricter alignment requirements than vocabulary prediction in decoder-only models.

Tuned-Lens-Style Affine Alignment for Encoder Truncation in Whisper ASR: An Empirical Investigation

Abstract

Resources