Knowledge Distillation

Creator
Creator
Seonglae Cho
Created
Created
2023 Jun 4 10:16
Editor
Edited
Edited
2025 Apr 27 0:29

Distillation from teacher network to student network (less parameter)

Pre-trained Teacher network → Student network
  • Create and transfer dataset (reusable and identical to existing training methods)
  • Train using KL Divergence loss between logit distributions from
    KL Divergence
    (more accurate training possible with code below)
Implementing KD begins with training the teacher model to its full capacity. Next, the student model is trained using a specific loss function.
This loss function is based not only the hard labels of the training data but also on the soft outputs (probabilities) generated by the teacher model. These soft outputs convey the teacher's confidence across various classes, offering a more nuanced understanding than hard labels alone. The process typically utilizes a temperature parameter to soften the probabilities, making the distribution more informative and easier for the student to learn from.

Cross Entropy
(G. Hinton etal.)

H(Pt,Ps)=yYPt(y)logPs(y)H(P_t,P_s) = -\sum_{y\in Y}P_t (y) \log P_s(y)
Knowledge Distillation Notion
 
 
Knowledge Distillation Usages
 
 

Geoffrey Hinton
,
Jeff Dean

NIPS 2014
Geoffrey Hinton
, 오리올 비니알스,
Jeff Dean
 
 

 
 
 
 

Recommendations