MTP
MTP Module
The residual output is sent to a mini single-layer transformer before the head to predict by adding one layer each after 2 tokens. The advantage is that it can reflect backpropagation by considering not only parallel next token prediction along with regular training, but also all relationships between multiple tokens.