As high as possible, as low as necessary for convergence
Typical is 1e-4 or somewhere the model get smaller, you can use more higher learning rate like 1e-3 or 1e-2
Model Regularization tradeoff


Learning Rate Usages
MiniCPM: Unveiling the Potential of End-side Large Language Models | Notion
Authors: Shengding Hu, Yuge Tu, Xu Han*, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Kaihuo Zhang, Yuxiang Huang, Zhenning Dai, Baitao Gong, Chongyi Wang, Yuan Yao, Jie Zhou, Jie Cai, Xinrong Zhang, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu*, Maosong Sun
https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20


Seonglae Cho