End-to-end omni-modal large language models enable seamless speech interaction but face challenges in maintaining speech-text consistency, where generated speech may be truncated before conveying the complete text content. Prior work suggests that premature end-of-sequence (EOS) token emission is a key failure mode in long-form speech generation. We propose Text-Length-Coupled Audio Stopping (TLC-AS), a training-free decode-time intervention that couples the speech stopping decision to the generated text length by computing a minimum audio token floor based on words-per-second calibration. However, our empirical study on Qwen2.5-Omni with VoiceBench CommonEval (200 samples) reveals a negative result: premature EOS is rare (only 0.5\% of samples exhibit early stopping under a raised audio token cap), and TLC-AS actually increases word error rate from 5.86\% to 9.05\%. The model's Thinker-Talker architecture already achieves good speech-text alignment without decode-time intervention. This finding highlights the importance of verifying that a target failure mode exists before designing solutions to address it.

Premature Speech EOS is Not a Dominant Failure Mode in Qwen2.5-Omni: An Empirical Study of Text-Length-Coupled Audio Stopping

Abstract

Resources