Orthostochastic Residual Mixing for Manifold-Constrained Hyper-Connections
Abstract
Manifold-Constrained Hyper-Connections (mHC) stabilize deep network training by constraining residual mixing matrices to be doubly stochastic via Sinkhorn-Knopp projection. However, this iterative projection requires 10--20 iterations per forward pass, adding computational overhead. We investigate \emph{orthostochastic matrices}---doubly stochastic matrices formed by entrywise squaring an orthogonal matrix---as a simpler alternative constructed via Newton-Schulz iteration. On language model pretraining with nanoGPT, orthostochastic mHC matches Sinkhorn-projected mHC at residual streams (validation loss , within ), while a small gap emerges at (, between and ), consistent with reduced expressiveness at larger . Both methods provide approximately 39% gradient stabilization compared to unconstrained Hyper-Connections and converge to near-identity mixing matrices during training. Our results suggest that orthostochastic construction offers a viable alternative to Sinkhorn projection for small , leveraging well-understood orthogonalization primitives.