Multi Token Prediction

MTP

Multiple head

MTP Module

The residual output is sent to a mini single-layer transformer before the head to predict by adding one layer each after 2 tokens. The advantage is that it can reflect backpropagation by considering not only parallel next token prediction along with regular training, but also all relationships between multiple tokens.

arxiv.org

https://arxiv.org/pdf/2404.19737

facebook/multi-token-prediction · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/facebook/multi-token-prediction

Multi Token Prediction

MTP

MTP Module

Recommendations