Hyper Connection

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 29 17:14
Editor
Edited
Edited
2026 Jan 6 18:7
HC expands the residual stream by a factor of n in the width direction, so that instead of a "single-line residual" between layers, there are n parallel residual streams flowing. At each layer, it introduces three small learnable mixing operations:
notion image
  • : aggregates the n streams to create the original block input dimension C for the block F(·)
  • : scatters the block output C back into the n streams
  • : mixes the streams with each other inside the residual stream using an n×n matrix
notion image
This lowers pretraining loss and, as a result, improves downstream benchmark performance (such as MMLU, GSM8K).

Limitation

HC breaks identity mapping. The product ∏H^{res} can easily cause signal explosion/vanishing → training instability at large scale. The paper shows that at 27B, loss/grad norm spikes occur, and "the gain of the composite mapping spikes up to around 3000."

mHC

mHC cuts transformer gradient explosion from 3000→1.6
notion image
mHC applies a manifold constraint to the residual matrices as doubly stochastic matrices (
Birkhoff polytope
), preserving signal mean and norm. Uses
Sinkhorn–Knopp
for projection. Eliminates signal explosion/vanishing, ensures stability across depth. At large scale (27B etc.), achieves stable training + performance improvement. Through kernel fusion, recomputation, and pipeline communication overlap, additional overhead is ~6.7%.
So in mHC, projecting onto the Birkhoff polytope is equivalent to making doubly stochastic. Training is performed freely → at each step, the result is projected onto the Birkhoff polytope using Sinkhorn-Knopp.
If we view the residual stream as "mixing," the most natural form of "safe mixing" is doubly stochastic (DS). The Birkhoff polytope is the convex hull of permutations → "soft permutation / soft routing." This is the only type of constraint that preserves meaning even when accumulated (multiplied) across depth.
 
 

mHC: Manifold-Constrained Hyper-Connections

 
 

Recommendations