initialization
초기에는 작은 learning rate 사용하다 안정되면 높이는 방식
warmup_stepsNumber of steps used for a linear warmup from 0 to learning_rate
warmup_ratioRatio of total training steps used for a linear warmup from 0 tolearning_rate.
Seonglae Cho
Seonglae Chowarmup_steps Number of steps used for a linear warmup from 0 to learning_ratewarmup_ratio Ratio of total training steps used for a linear warmup from 0 to learning_rate.