Step-Down Bridge Guidance Scheduling for Dual-CFG in Video-Audio Diffusion
Abstract
Joint video-audio diffusion models such as MOVA employ dual classifier-free guidance (dual-CFG) with separate bridge guidance for video-audio alignment and text guidance for content control. However, constant bridge guidance throughout denoising may hurt speech fidelity by crowding out text-condition sensitivity in late steps. We propose Step-Down bridge guidance scheduling, a training-free technique that maintains high bridge guidance () in early denoising steps for structural alignment, then reduces it via cosine ramp to in late steps to restore text-condition sensitivity. Our approach is motivated by norm analysis showing that the bridge-to-text guidance ratio increases from 1.04 to 1.47 across denoising steps. On Verse-Bench speech prompts, Step-Down scheduling achieves 1.5% WER improvement over the constant baseline while preserving synchronization quality (AV-A within 1.2%). Crucially, timing matters: Step-Down outperforms Step-Up (same values, reversed order) by 1.9%, demonstrating that the temporal allocation of guidance strength determines speech fidelity.