Residual Connection

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Mar 7 13:2
Editor
Edited
Edited
2024 Nov 25 15:30

Skip connection, residual pathway

기본적으로 하위 층에서 학습된 정보가 데이터 처리 과정에서 손실되는 것을 방지하기 위한 방법
As a result, Residual connections provide an easier function to approximate identity function
Residual connections require normalization to prevent output variance explosion when used with SOTA
Weight Initialization
(Xavier, He). This is because as variance continues to increase even though
Xavier Initialization
aims to have constant variance since there are two channels including identity residual like , explosive gradient problems can occur as the network gets deeper.
Despite this, the reason transformer training remains unstable is because
Xavier Initialization
and training algorithms (SGD with
Adam Optimizer
) are not perfectly stable. Zhang et al., 2019 suggested that ResNet-style models could be trained stably without normalization through proper initialization and settings. This opens the door for further research into stabilizing Transformer training through customized initialization and optimization strategies.
https://wikidocs.net/31379
layer 에서 사용하는 dimension 크기가 같아야 한다.
 
 
 
 
 
 

Recommendations