Sophia: A Scalable Stochastic Second-order Optimizer for Language...
Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its...
https://arxiv.org/abs/2305.14342