The context window remains unchanged. Only the most recent tokens and attention sinks are retained
StreamingLLM is orthogonal to recent context extension methods and can be integrated with them
연속적인 스트리밍 데이터에 대한 효율적인 학습 및 추론
메모리 사용량을 크게 줄이면서도 높은 성능을 유지
최대 4백만 토큰 이상의 입력을 안정적이고 효율적으로 처리
SINK TOKEN
Even though keeping the first token might seem semantically meaningless, it has significance. This is because due to the characteristics of the Attention Mechanism, the first token is used as an anchor for calculating the Attention Score through positional embedding. Therefore, even if it's semantically meaningless, the model structurally requires it.