Activation Beacon

Created
Created
2024 Apr 14 6:32
Creator
Creator
Seonglae ChoSeonglae Cho
Editor
Edited
Edited
2024 Apr 14 7:44
Refs
Refs

Condense Ratio → Condensed Activation

Condense activation through Beacon token
The beacon token ⟨bcn⟩ is appended to a context, which prompts the LLM to condense the raw activations into more compact forms
The condensed activations are streamingly processed with the sliding window for auto-regression
We also maintain another copy of the LLM’s MHA (multi-head self-attention) parameters, denoted as MHA, including the layer-wise projection matrices for queries, keys, values, and outputs. These parameters are specifically learned for condensing the activations. They are accounting for 1/3 of the LLM’s original parameters (e.g., 2B with the LLaMA-2 7B model).
As for the inference time, Activation Beacon is faster than LongLlama, but slower than LongChat when the context is short. This is because Activation Beacon is streamingly processed while LongChat is fully parallel
추가적인 pluin 학습 weight이 들어간다는 점
 
 
 
 
 

Recommendations