MTP
Multiple head

MTP Module
The residual output is sent to a mini single-layer transformer before the head to predict by adding one layer each after 2 tokens. The advantage is that it can reflect backpropagation by considering not only parallel next token prediction along with regular training, but also all relationships between multiple tokens.
facebook/multi-token-prediction · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/facebook/multi-token-prediction

Seonglae Cho