ScaffoldSwap: Are Discrete Speech Units Necessary as a Temporal Scaffold for Audio-Driven 3D Facial Animation?
Abstract
Audio-driven 3D facial animation increasingly uses discrete speech units as temporal scaffolds, yet it remains unclear whether discretization is uniquely beneficial or if simpler phoneme+timing scaffolds suffice. We present ScaffoldSwap, a controlled ablation study comparing three speech conditioning approaches---continuous SSL features (WavLM), discrete speech units (HuBERT + k-means), and phoneme+timing (forced alignment)---with an identical decoder architecture. Experiments on BIWI and VOCASET reveal that discrete units achieve 10.2% and 5.9% lower Lip Vertex Error than SSL, while phoneme+timing achieves 9.0% and 5.5% improvements. Discrete units consistently outperform phoneme+timing by 0.5--1.3%, rejecting the scaffold equivalence hypothesis. Ablations show that k-means quantization provides a 17.5% improvement over continuous HuBERT features, demonstrating that discretization itself---not the underlying representation---drives the gains. Explicit timing features contribute negligibly (+0.01%), indicating frame-level phoneme identity alone captures sufficient temporal structure.