Streamling LLM

The context window remains unchanged. Only the most recent tokens and attention sinks are retained

StreamingLLM is orthogonal to recent context extension methods and can be integrated with them

연속적인 스트리밍 데이터에 대한 효율적인 학습 및 추론

메모리 사용량을 크게 줄이면서도 높은 성능을 유지

최대 4백만 토큰 이상의 입력을 안정적이고 효율적으로 처리

SINK TOKEN

Even though keeping the first token might seem semantically meaningless, it has significance. This is because due to the characteristics of the Attention Mechanism, the first token is used as an anchor for calculating the Attention Score through positional embedding. Therefore, even if it's semantically meaningless, the model structurally requires it.

streaming-llm

mit-han-lab • Updated 2024 Nov 27 10:24

arxiv.org

https://arxiv.org/pdf/2309.17453v1.pdf

Streamling LLM

The context window remains unchanged. Only the most recent tokens and attention sinks are retained

SINK TOKEN

Recommendations