RMSprop + Momentum (Adagrad)파라미터마다 두 개의 추가 벡터를 유지하기 때문에 메모리 사용량이 크게 증가First momentumSecond momentumhttps://moon-walker.medium.com/large-model-학습의-game-changer-ms의-deepspeed-zero-1-2-3-그리고-zero-infinity-74c9640190de θt+1=θ−ηv^+ϵmt^\theta_{t+1} = \theta - \frac{\eta}{\sqrt{\hat{v}} + \epsilon} \hat{m_t}θt+1=θ−v^+ϵηmt^η\etaη means learning rateϵ\epsilonϵ to prevent 0 parent arxiv.orghttps://arxiv.org/pdf/1412.6980.pdf