Silence-Conditional Output Suppression for Training-Free Whisper Hallucination Mitigation
Abstract
Whisper, a widely deployed automatic speech recognition model, hallucinates fluent but fabricated text when processing non-speech audio, exhibiting a 100% hallucination rate on environmental sound datasets. Existing mitigations require fine-tuning or external voice activity detection. We propose Silence-Conditional Output Suppression, a training-free inference-time method that leverages Whisper's internal no-speech probability signal () to conditionally suppress output. When exceeds a threshold, we output an empty transcription; otherwise, standard decoding proceeds. On UrbanSound8K, our method reduces hallucination rate from 100% to 60.1% (39.9 percentage point reduction) while incurring minimal word error rate degradation on LibriSpeech: +0.19 percentage points on test-clean and 0 on test-other. Ablation studies confirm that the suppression policy, not decoder head masking, drives the improvement. Our analysis reveals class-dependent effectiveness, with the method working well on non-speech-like sounds but struggling with speech-like environmental audio.