Distillation from teacher network to student network (less parameter)
Pre-trained Teacher network → Student network
- Create and transfer dataset (reusable and identical to existing training methods)
- Train using KL Divergence loss between logit distributions from KL Divergence (more accurate training possible with code below)
Implementing KD begins with training the teacher model to its full capacity. Next, the student model is trained using a specific loss function.
This loss function is based not only the hard labels of the training data but also on the soft outputs (probabilities) generated by the teacher model. These soft outputs convey the teacher's confidence across various classes, offering a more nuanced understanding than hard labels alone. The process typically utilizes a temperature parameter to soften the probabilities, making the distribution more informative and easier for the student to learn from.
Cross Entropy (G. Hinton etal.)
Knowledge Distillation Notion
Knowledge Distillation Usages
Geoffrey Hinton, Jeff Dean
NIPS 2014 Geoffrey Hinton, 오리올 비니알스, Jeff Dean