Residual Connection

Skip connection, residual pathway

기본적으로 하위 층에서 학습된 정보가 데이터 처리 과정에서 손실되는 것을 방지하기 위한 방법

As a result, Residual connections provide an easier function to approximate identity function

Residual connections require normalization to prevent output variance explosion when used with SOTA

Weight Initialization (Xavier, He). This is because as variance continues to increase even though

Xavier Initialization aims to have constant variance since there are two channels including identity residual like , explosive gradient problems can occur as the network gets deeper.

Despite this, the reason transformer training remains unstable is because

Xavier Initialization and training algorithms (SGD with

Adam Optimizer) are not perfectly stable. Zhang et al., 2019 suggested that ResNet-style models could be trained stably without normalization through proper initialization and settings. This opens the door for further research into stabilizing Transformer training through customized initialization and optimization strategies.

layer 에서 사용하는 dimension 크기가 같아야 한다.

arxiv.org

https://arxiv.org/pdf/1512.03385.pdf

arxiv.org

https://arxiv.org/pdf/1901.09321

[DL] Exploding & Vanishing Gradient 문제와 Residual Connection

Residual connection은 대표적으로 컴퓨터 비전 분야에서의 ResNet 모델과 자연어 처리 분야에서의 transformer 모델에서 더 좋은 성능을 내기 위해 사용되었다. 간단히 말하자면, residual connection은 아주 deep한 신경망에서 하위 층에서 학습된 정보가 데이터 처리 과정에서 손실되는 것을 방지하기 위한 방법이다. '정보 소실'이란 무엇인지, 그리고 왜 일어나는지 이해하기 위해서는 신경망을 학습하는데 있어 고질적인 문제인 exploding gradient problem과 vanishing gradient problem에 대해 먼저 알아볼 필요가 있다. Exploding gradient & Vanishing gradient problems 층이 많은 신경망에서 grad..

https://heeya-stupidbutstudying.tistory.com/entry/DL-Exploding-Vanishing-gradient-문제와-Residual-Connection잔차연결

Residual Connection

Skip connection, residual pathway

Recommendations