Language Model Context

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Jul 9 12:44
Editor
Edited
Edited
2024 Sep 30 23:53

LLM is quite dumb without context

Extrapolation
ability above trained context length is an important issue and this is the weak part of the
Transformer Model
.
The first appearance was the hacking using
Positional Embedding
. The trigonometric-based absolute positional encoding worked well by separating frequencies, but there is a problem that it does not work well unless it is data of the length often seen during training because extrapolation does not work.
What appeared was Relative Positional Encoding, which changes the Attention score calculation according to the relative distance between tokens.
RoPE
is representative, and the feature of this encoding is to indicate the position using vector rotation operation, where the distance between two tokens and the angle rotated by the max context window size are determined. So, wouldn't it be possible to process long data while maintaining the information learned for short data by first learning for short data, then increasing the model's context windows size and proportionally reducing the rotation speed for fine-tuning for long data? The model, trained for lengths of 2k and 4k, worked well without a significant drop in perplexity even when extended to 16k and 32k. Various methods of position interpolation using RoPE's characteristics have been studied. Instead of finetuning the model, there was a bright prospect that it could be applied with RAG to any desired service as long as there was enough data, utilizing the in-context learning ability of the transformer.
Lost in the middle
poured cold water on
RoPE
’s prospect. It showed that these models, which expanded the context using
RoPE
interpolation, referenced the beginning and end of the prompt well but did not capture the middle part well. It even stated that performance could be worse than when no data was given at all.
Gemini Google
is seeking a solution with
Ring Attention
.
With Reformer and Performer in mind, many efficient transformer structures have been poured out to support long context while reducing the cost of Attention's O(N^2), and there have been some promising things, but in the end, it turned out that the ordinary vanilla attention is the best as the data increases. If there is a lesson that the transformer constantly tells us since it first appeared, it is to follow the rules without tricks.
Language Model Context Notion
 
 
너 데이터 fine tunign 학습한거야 아니면 prompt로만 이렇게 되버린 거야
Language Model Context Usages
 
 
https://arxiv.org/pdf/2307.03172.pdf
LLaMa prefers the document given at the end.
LLaMa prefers the document given at the end.
 
Anything else usually prefer the early part
Anything else usually prefer the early part
 
 
 

Long-term dependency
History

 
 

Recommendations