Activation Beacon

Condense Ratio → Condensed Activation

Condense activation through Beacon token

The beacon token ⟨bcn⟩ is appended to a context, which prompts the LLM to condense the raw activations into more compact forms. Additional plugin weights are required , .

The condensed activations are streamingly processed with the sliding window for auto-regression

We also maintain another copy of the LLM’s MHA (multi-head self-attention) parameters, denoted as MHA, including the layer-wise projection matrices for queries, keys, values, and outputs. These parameters are specifically learned for condensing the activations. They are accounting for 1/3 of the LLM’s original parameters (e.g., 2B with the LLaMA-2 7B model).

As for the inference time, Activation Beacon is faster than LongLlama, but slower than LongChat when the context is short. This is because Activation Beacon is streamingly processed while LongChat is fully parallel.

‣

Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon

The utilization of long contexts poses a big challenge for LLMs due to their limited context window size. Although the context window can be extended through fine-tuning, it will result in a considerable cost at both t…

https://ar5iv.labs.arxiv.org/html/2401.03462

Activation Beacon

Condense Ratio → Condensed Activation

Backlinks

Recommendations