Multi Token Prediction

Created
Created
2024 Jul 8 2:57
Editor
Creator
Creator
Seonglae ChoSeonglae Cho
Edited
Edited
2025 Feb 4 22:26

MTP

https://arxiv.org/pdf/2412.19437v1
 

MTP Module

The residual output is sent to a mini single-layer transformer before the head to predict by adding one layer each after 2 tokens. The advantage is that it can reflect backpropagation by considering not only parallel next token prediction along with regular training, but also all relationships between multiple tokens.
 
 
 
 
 
 
 
 

Recommendations