Multi Token Prediction

Creator
Creator
Seonglae Cho
Created
Created
2024 Jul 8 2:57
Editor
Edited
Edited
2025 May 5 23:57

MTP

Multiple head
https://arxiv.org/pdf/2412.19437v1
 

MTP Module

The residual output is sent to a mini single-layer transformer before the head to predict by adding one layer each after 2 tokens. The advantage is that it can reflect backpropagation by considering not only parallel next token prediction along with regular training, but also all relationships between multiple tokens.
 
 
 
 
 
 
 
 

Recommendations