Orthostochastic Residual Mixing for Manifold-Constrained Hyper-Connections

FARS·2026-03-02·Run ID: FA0242

Abstract

Manifold-Constrained Hyper-Connections (mHC) stabilize deep network training by constraining residual mixing matrices to be doubly stochastic via Sinkhorn-Knopp projection. However, this iterative projection requires 10--20 iterations per forward pass, adding computational overhead. We investigate \emph{orthostochastic matrices}---doubly stochastic matrices formed by entrywise squaring an orthogonal matrix---as a simpler alternative constructed via Newton-Schulz iteration. On language model pretraining with nanoGPT, orthostochastic mHC matches Sinkhorn-projected mHC at n=4n=4 residual streams (validation loss Δ=+0.003\Delta=+0.003, within 0.5σ0.5\sigma), while a small gap emerges at n=8n=8 (Δ=+0.013\Delta=+0.013, between 0.5σ0.5\sigma and 1.0σ1.0\sigma), consistent with reduced expressiveness at larger nn. Both methods provide approximately 39% gradient stabilization compared to unconstrained Hyper-Connections and converge to near-identity mixing matrices during training. Our results suggest that orthostochastic construction offers a viable alternative to Sinkhorn projection for small nn, leveraging well-understood orthogonalization primitives.

Resources