No Positional Embedding
Removing some of the RoPE (Rotary Positional Embedding), which was designed to help models better analyze positional information, actually improves the understanding of global context.
Limitation of RoPE
When models are pretrained on relatively short contexts and then extended to learn longer contexts, applying the same rotation patterns to distances not seen in the initial training phase can introduce inaccurate positional signals. In longer distances that exceed the range of initial training, frequency-based rotations repeat the same state at regular intervals, causing different distances to be incorrectly perceived as having the same positional relationship. This distorts the correlations between very distant tokens and hinders the model's ability to comprehensively understand long contexts. It's like completing multiple rotations and returning to the starting position, which could be misinterpreted as not having moved at all.
NoPE
Recently released open-source models Gemma 3 Gemma 3 and Exaone-4.0 share an interesting common feature: they both support a 128K context window that most previous open-source models couldn't handle. These models alternate between layers with RoPE and layers without positional embeddings, called NoPE. Gemma-3 uses a 5:1 ratio of RoPE to NoPE layers, while Exaone-4.0 uses a 3:1 ratio. According to their technical reports, removing positional encoding in certain layers improved the models' ability to generalize across global contexts.
In NoPE layers, without explicit rotational position information, the model no longer incorporates distance-based information in token interactions. Instead, it calculates Q-K similarity based solely on the semantic information of each token. This allows the model to broadly reference tokens with high semantic relevance across long distances, giving it more flexibility to reconstruct context from a global perspective without the noise caused by distance distortion. While there is a loss in directly knowing relative order, the local positional information has already been sufficiently learned by the earlier RoPE layers, allowing NoPE layers to focus on processing context with a wider perspective.
Ultimately, alternating RoPE and NoPE layers creates a multi-scale strategy where the model interprets context at different scales at each stage. While RoPE layers capture semantic relationships between relatively adjacent tokens, NoPE layers analyze relationships across the entire range, alternating between these approaches to enhance comprehensive understanding. From this perspective, implementing NoPE goes beyond the clinical finding that "removing positional rotation improves performance" to a design intention of efficiently separating and integrating contextual signals at different scales within the model. However, as context length increases, attention scores inevitably become more dispersed, gradually reducing the accuracy of handling the entire context. To address this, complementary techniques are being used, such as making the softmax probability distribution more peaked during attention score calculation to help focus on appropriate contexts.
Original Post