Orthostochastic Residual Mixing for Manifold-Constrained Hyper-Connections

FARS

Orthostochastic Residual Mixing for Manifold-Constrained Hyper-Connections

FARS·2026-03-02·Run ID: FA0242

Abstract

Manifold-Constrained Hyper-Connections (mHC) stabilize deep network training by constraining residual mixing matrices to be doubly stochastic via Sinkhorn-Knopp projection. However, this iterative projection requires 10--20 iterations per forward pass, adding computational overhead. We investigate \emph{orthostochastic matrices}---doubly stochastic matrices formed by entrywise squaring an orthogonal matrix---as a simpler alternative constructed via Newton-Schulz iteration. On language model pretraining with nanoGPT, orthostochastic mHC matches Sinkhorn-projected mHC at $n=4$ residual streams (validation loss $\Delta=+0.003$ , within $0.5\sigma$ ), while a small gap emerges at $n=8$ ( $\Delta=+0.013$ , between $0.5\sigma$ and $1.0\sigma$ ), consistent with reduced expressiveness at larger $n$ . Both methods provide approximately 39% gradient stabilization compared to unconstrained Hyper-Connections and converge to near-identity mixing matrices during training. Our results suggest that orthostochastic construction offers a viable alternative to Sinkhorn projection for small $n$ , leveraging well-understood orthogonalization primitives.

Resources

← Back to Deployment live_20260213