Knowledge Distillation

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Jun 4 10:16
Editor
Edited
Edited
2024 Jun 16 6:50

Distillation from teacher network to student network (less parameter)

Pre-trained Teacher network → Student network
  • dataset 만들어서 전달 (재사용가능하고 기존 훈련방식과 동일)
  • logit 분포
    KL Divergence
    로 분포차이 loss로 학습 (아래 코드로 좀더 정확하게 학습가능)
Implementing KD begins with training the teacher model to its full capacity. Next, the student model is trained using a specific loss function.
This loss function is based not only the hard labels of the training data but also on the soft outputs (probabilities) generated by the teacher model. These soft outputs convey the teacher's confidence across various classes, offering a more nuanced understanding than hard labels alone. The process typically utilizes a temperature parameter to soften the probabilities, making the distribution more informative and easier for the student to learn from.

Cross Entropy
(G. Hinton etal.)

Knowledge Distillation Notion
 
 
Knowledge Distillation Usages
 
 

Geoffrey Hinton
,
Jeff Dean

NIPS 2014
Geoffrey Hinton
, 오리올 비니알스,
Jeff Dean
 
 

 
 
 
 

Recommendations