Enables training with virtually simulate larger batch sizes by accumulating gradients over iterations, enhancing stability and model quality. Useful for training large-scale models in NLP or vision domains where memory constraints restrict batch sizes.
Model Generalization 측면에서 좋고 매번 업데이트 안해줘서 좋다
너무 크게 하면 local minimum에 빠질 수 있으니 조심해야
sequence length normalization matters