Dynamic NTK

NTK-aware interpolation

In order to resolve the problem of losing high frequency information when interpolating the RoPE embeddings, the NTK-aware interpolation came into picture. Instead of scaling every dimension of RoPE equally by a factor, we spread out the interpolation pressure across multiple dimensions by scaling high frequencies less and low frequencies more.

However, one major disadvantage of this method is that given it is not just an interpolation scheme, some dimensions are slightly extrapolated to out-of-bound values, thus fine-tuning with NTK-aware interpolation yields inferior results. In practice, the scale value has to be set higher than the expected scale for a given context length extension.

NTK-by-parts Interpolation

When stretching RoPE dimensions uniformly, tokens become closer, impairing the model’s ability to understand small and local relationships between internal embeddings. NTK-by-parts interpolation proposes a solution by not interpolating higher frequency dimensions at all and always interpolating lower frequency dimensions.

Dynamic NTK interpolation

With the above approach the model may experience a performance drop at a length less than L and an abrupt degradation when the sequence length is longer than L. Introducing the Dynamic Scaling method as a solution to gracefully degrade instead of breaking immediately when hitting the trained context limit. Dynamic Scaling as an inference-time method where the scale factor is updated dynamically during each forward-pass based on the sequence length.

The importance of handling RoPE embeddings correctly when implementing Dynamic Scaling with

KV Cache, suggesting thatKV-embeddings should be cached before applying RoPE, as the RoPE embedding of every token changes when s changes.

Understanding YaRN: Extending Context Window of LLMs

YaRN: Yet another RoPE extensioN method