RWKV

Receptance Weighted Key Value

Parallelizable training

It has the advantage of having a simple inference process like RNN, and being able to learn in parallel and efficiently like a transformer.

Among models of similar size, it consumes the least energy per token. Works good for Multilingual tasks

RNN은 가장 최근 time step의 hidden state만 가지고 있으면 되기 때문에, 기존 context의 모든 token에 대해 KV Cache를 저장해야 하는 트랜스포머에 비해 inference 메모리 효율은 좋다. Training 시에 parallel이 문제인데 RWKV나

Reformer RNN이면서도 Transformer처럼 학습 시에는 여러 토큰에 대해 동시에 연산을 수행할 수 있도록 하는 구조

🦅 Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-v5)

A brand new era for the RWKV-v5 architecture and linear transformer's has arrived - with the strongest multi-lingual model in open source today

https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers

🦅 Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-v5)

RWKV: Reinventing RNNs for the Transformer Era

Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast,...

https://arxiv.org/abs/2305.13048

RWKV

Receptance Weighted Key Value

Parallelizable training

Backlinks

Recommendations