It looks at every example in the entire training set on every step
Gradient Accumulation 이랑 다른 점은 전체 데이터셋에 대한 손실 함수를 적용한다는 것
Learning rate and mini-batch effect the smoothness of the loss landscape via an analysis of the Hessian. The largest eigenvalue of the Hessian Matrix described the gradient direction which is most changing and thus defined a speed limit on updates