HC expands the residual stream by a factor of n in the width direction, so that instead of a "single-line residual" between layers, there are n parallel residual streams flowing. At each layer, it introduces three small learnable mixing operations:

- : aggregates the n streams to create the original block input dimension C for the block F(·)
- : scatters the block output C back into the n streams
- : mixes the streams with each other inside the residual stream using an n×n matrix

This lowers pretraining loss and, as a result, improves downstream benchmark performance (such as MMLU, GSM8K).
Limitation
HC breaks identity mapping. The product ∏H^{res} can easily cause signal explosion/vanishing → training instability at large scale. The paper shows that at 27B, loss/grad norm spikes occur, and "the gain of the composite mapping spikes up to around 3000."
mHC
mHC cuts transformer gradient explosion from 3000→1.6

mHC applies a manifold constraint to the residual matrices as doubly stochastic matrices (Birkhoff polytope), preserving signal mean and norm. Uses Sinkhorn–Knopp for projection. Eliminates signal explosion/vanishing, ensures stability across depth. At large scale (27B etc.), achieves stable training + performance improvement. Through kernel fusion, recomputation, and pipeline communication overlap, additional overhead is ~6.7%.
So in mHC, projecting onto the Birkhoff polytope is equivalent to making doubly stochastic. Training is performed freely → at each step, the result is projected onto the Birkhoff polytope using Sinkhorn-Knopp.
If we view the residual stream as "mixing," the most natural form of "safe mixing" is doubly stochastic (DS). The Birkhoff polytope is the convex hull of permutations → "soft permutation / soft routing." This is the only type of constraint that preserves meaning even when accumulated (multiplied) across depth.

Seonglae Cho