Skip connection, residual pathway
A method to prevent information learned in lower layers from being lost during the data processing pipeline
Mathematically, due to Edge of Chaos, the Jacobian spectral radius is always pulled close to 1, which reduces both vanishing and Exploding gradient problems.
As a result, Residual connections provide an easier function to approximate identity function
Residual connections require normalization to prevent output variance explosion when used with SOTA Weight Initialization (Xavier, He). This is because as variance continues to increase even though Xavier Initialization aims to have constant variance since there are two channels including identity residual like , explosive gradient problems can occur as the network gets deeper.
Despite this, the reason transformer training remains unstable is because Xavier Initialization and training algorithms (SGD with Adam Optimizer) are not perfectly stable. Zhang et al., 2019 suggested that ResNet-style models could be trained stably without normalization through proper initialization and settings. This opens the door for further research into stabilizing Transformer training through customized initialization and optimization strategies.

The dimensions used in each layer must be the same.

Seonglae Cho