Hyper Connection

HC expands the residual stream by a factor of n in the width direction, so that instead of a "single-line residual" between layers, there are n parallel residual streams flowing. At each layer, it introduces three small learnable mixing operations:

: aggregates the n streams to create the original block input dimension C for the block F(·)

: scatters the block output C back into the n streams

: mixes the streams with each other inside the residual stream using an n×n matrix

This lowers pretraining loss and, as a result, improves downstream benchmark performance (such as MMLU, GSM8K).

Limitation

HC breaks identity mapping. The product ∏H^{res} can easily cause signal explosion/vanishing → training instability at large scale. The paper shows that at 27B, loss/grad norm spikes occur, and "the gain of the composite mapping spikes up to around 3000."

mHC

mHC cuts transformer gradient explosion from 3000→1.6

mHC applies a manifold constraint to the residual matrices as doubly stochastic matrices (

Birkhoff polytope), preserving signal mean and norm. Uses

Sinkhorn–Knopp for projection. Eliminates signal explosion/vanishing, ensures stability across depth. At large scale (27B etc.), achieves stable training + performance improvement. Through kernel fusion, recomputation, and pipeline communication overlap, additional overhead is ~6.7%.

So in mHC, projecting onto the Birkhoff polytope is equivalent to making doubly stochastic. Training is performed freely → at each step, the result is projected onto the Birkhoff polytope using Sinkhorn-Knopp.

If we view the residual stream as "mixing," the most natural form of "safe mixing" is doubly stochastic (DS). The Birkhoff polytope is the convex hull of permutations → "soft permutation / soft routing." This is the only type of constraint that preserves meaning even when accumulated (multiplied) across depth.

arxiv.org

https://arxiv.org/pdf/2409.19606

mHC: Manifold-Constrained Hyper-Connections

arxiv.org

https://arxiv.org/pdf/2512.24880

Hyper Connection

Limitation

mHC

mHC: Manifold-Constrained Hyper-Connections

Recommendations