Distillation from teacher network to student network (less parameter)
Pre-trained Teacher network → Student network
- dataset 만들어서 전달 (재사용가능하고 기존 훈련방식과 동일)
- logit 분포 KL Divergence 로 분포차이 loss로 학습 (아래 코드로 좀더 정확하게 학습가능)
Implementing KD begins with training the teacher model to its full capacity. Next, the student model is trained using a specific loss function.
This loss function is based not only the hard labels of the training data but also on the soft outputs (probabilities) generated by the teacher model. These soft outputs convey the teacher's confidence across various classes, offering a more nuanced understanding than hard labels alone. The process typically utilizes a temperature parameter to soften the probabilities, making the distribution more informative and easier for the student to learn from.
Cross Entropy (G. Hinton etal.)
Knowledge Distillation Notion
Knowledge Distillation Usages
Geoffrey Hinton, Jeff Dean
NIPS 2014 Geoffrey Hinton, 오리올 비니알스, Jeff Dean