Distillation from teacher network to student network (less parameter)
Pre-trained Teacher network → Student network
- Create and transfer dataset (reusable and identical to existing training methods)
- Train using KL Divergence loss between logit distributions from KL Divergence (more accurate training possible with code below)
Implementing KD begins with training the teacher model to its full capacity. Next, the student model is trained using a specific loss function.
This loss function is based not only the hard labels of the training data but also on the soft outputs (probabilities) generated by the teacher model. These soft outputs convey the teacher's confidence across various classes, offering a more nuanced understanding than hard labels alone. The process typically utilizes a temperature parameter to soften the probabilities, making the distribution more informative and easier for the student to learn from.
Cross Entropy (G. Hinton etal.)
Knowledge Distillation Notion
Knowledge Distillation Usages
Geoffrey Hinton, Jeff Dean
Distilling the Knowledge in a Neural Network
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately,...
https://arxiv.org/abs/1503.02531

딥러닝 모델 지식의 증류기법, Knowledge Distillation
A minimal, portfolio, sidebar, bootstrap Jekyll theme with responsive web design and focuses on text presentation.
https://baeseongsu.github.io/posts/knowledge-distillation/
딥러닝 용어 정리, Knowledge distillation 설명과 이해
이 글은 제가 공부한 내용을 정리하는 글입니다. 따라서 잘못된 내용이 있을 수도 있습니다. 잘못된 내용을 발견하신다면 리플로 알려주시길 부탁드립니다. 감사합니다. Knowledge distillation 이란? Knowledge distillation 은 NIPS 2014 에서 제프리 힌튼, 오리올 비니알스, 제프 딘 세 사람의 이름으로 제출된 "Distilling the Knowledge in a Neural Network" 라는 논문에서 제시된 개념입니다. Knowledge distillation 의 목적은 "미리 잘 학습된 큰 네트워크(Teacher network) 의 지식을 실제로 사용하고자 하는 작은 네트워크(Student network) 에게 전달하는 것" 입니다. 이 목적을 풀어서 설명하면 ..
https://light-tree.tistory.com/196

Seonglae Cho