Opinion

Massive Activations in DroPE: Evidence for Attention Reorganization

​Published on January 18, 2026 3:05 PM GMTSummaryI do a quick experiment to investigate how DroPE (Dropping Positional Embeddings) models differ from standard RoPE models in their use of “massive values”  (that is, concentrated large activations in Query and Key tensors) that prior work identifies as important for contextual understanding. I did this in my personal time, for fun.Two main findings:DroPE reduces massive value concentration significantly in Query tensors compared to RoPE.RoPE relies way more on massive values than DroPE, and disrupting them breaks RoPE but only degrades DroPE.These findings suggest that, during recalibration, DroPE learns alternative attention mechanisms that don’t depend on concentrated features.BackgroundWhat Are Massive Values?Massive values are unusually large activations in the Query (Q) and Key (K) tensors of transformer attention layers. They were identified by Jin et al. (2025) to have the pattern of:Being concentrated in low-frequency RoPE dimensionsBeing present in Q and K but notably absent in VBeing critical for contextual knowledge understanding tasks (passkey retrieval, sentiment analysis, mathematical reasoning) but not for parametric knowledge retrieval (factual recall)Jin et al. provide a mechanistic explanation rooted in RoPE’s frequency structure. RoPE divides the head dimension into pairs, each rotating at a frequency θj=10000(−2j/d). High-frequency components (small j) change rapidly with position, encoding fine-grained positional information. Low-frequency components (large j) change slowly, and Jin et al. argue these dimensions primarily encode semantic content rather than position. They find that when they disrupt these massive values, this devastates contextual understanding tasks while parametric knowledge tasks show only a little bit of degradation.What Is DroPE?DroPE (Gelberg et al., 2025) is a method that removes Rotary Position Embeddings (RoPE) from pretrained models and recalibrates them, which has the effect of zero-shot extending context length.The claim here is like: RoPE scaling methods (PI, YaRN, NTK) attempt to extend context by compressing rotation frequencies. But low-frequency RoPE components never complete a full rotation during training (ϕm(Ctrain)<2π for small ωₘ). At extended lengths, these phases become out-of-distribution, so any scaling method must compress low frequencies by factor 1/s to keep phases in range. But this compression shifts attention weights at long distances, where semantic matching matters a lot.These papers make seemingly incompatible claims.Jin et al. claim that RoPE -> massive values -> essential for contextual understandingGelberg et al. claim that remove RoPE -> better context extension with preserved capabilitiesIf massive values are caused by RoPE and critical for understanding, how does DroPE maintain performance?So we can check a pretty neat and well-scoped research question: are massive values a cause or consequence of contextual knowledge capabilities? And the proxy test we can do cheaply here is: does DroPE, after recalibration, still have massive values?Experiment 1: Massive Value ComparisonMethodologyModels compared:meta-llama/Llama-2-7b-hf (standard RoPE)SakanaAI/Llama-2-7b-hf-DroPE (RoPE removed + recalibrated)Procedure:Load both models with identical tokenizerProcess N diverse text samplesExtract Q, K, V tensors from all 32 layers using forward hooks on projection outputsCompute L2 norm matrix M[head, dim] for each tensorCount positions where M > 5.0 × mean(M) (the definition for massive values)Repeat across multiple samples and report mean plus minus stdText samples used: 10  texts including:Literary: Hobbit, Tale of Two Cities, Moby Dick excerptsTechnical: ML/transformer descriptionsConversational: Dialogue snippetsFactual: Scientific descriptionsResultsTensorRoPE (mean ± std)DroPE (mean ± std)ChangeQuery1475.5 ± 22.6901.4 ± 36.0-38.9%Key1496.8 ± 69.81331.5 ± 74.1-11.0%Value174.0 ± 10.7176.6 ± 5.7+1.5% Figure 1: Massive value counts for Query, Key, and Value tensors. Error bars show ±1 standard deviation across 10 text samples. DroPE shows 39% reduction in Query and 11% reduction in Key.We also plot this across layers.Figure 2: Query massive values by layer. The shaded area shows the reduction from RoPE to DroPE. DroPE consistently has ~17 fewer massive values per layer.InterpretationHow do we interpret these results? Query shows the largest reduction in number of massive values. Roughly, the Query tensor encodes “what to look for” in attention, which is the model’s representation of what information the current position needs. DroPE models learn to distribute this information more evenly across dimensions rather than concentrating it in the low-frequency RoPE dimensions.Key shows moderate reduction in number of massive values. Roughly, the Key tensor encodes “what information is here” at each position. The smaller reduction suggests some concentration patterns persist, possibly because Key representations must still support some semantic matching.Value is unchanged, within error bars. Mostly just confirms the Jin et al. finding.Low variance across text types (std ~2-5% of mean) indicates this is a robust structural property of the models, not dependent on input content.However, a closer look at Figure 2 shows DroPE didn’t uniformly reduce massive values.Figure 3: Layer 1 is the only layer where DroPE has MORE massive values than RoPE. This suggests DroPE concentrates some position-independent processing in the first layer.Not sure how to interpret this. Possibly, without positional embeddings, DroPE may use layer 1 to establish token relationships through content alone, then rely less on concentrated features in subsequent layers.Experiment 2: Disruption ExperimentMotivationFinding 1 shows DroPE has fewer massive values, but are these values still functionally important? We test this by zeroing out massive value dimensions and measuring model degradation.MethodologyProcedure:Identify massive value dimensions in Q and K projections (threshold λ=5.0)Register forward hooks that zero out these specific dimensionsMeasure perplexity on held-out text before and after disruptionCompare to control: zeroing same number of random dimensionsRepeat with 10 different random seeds for control conditionDisruption implementation:# Hook on q_proj output
def hook(module, input, output):
# mask: boolean tensor where True = massive value dimension
zero_mask = (~mask).to(output.dtype) # 0 where massive, 1 elsewhere
return output * zero_mask # Zero out massive dimensionsMetric: Our metric of choice here is M-R Difference = (Massive disruption PPL increase) – (Random disruption PPL increase) and we interpret it asHigher M-R difference = model relies more on massive values specificallyResultsRaw Perplexity ValuesModelBaselineMassive ZeroedRandom ZeroedRoPE1.301,508.51.31DroPE1.4922.71.49Percent Increase (mean ± std across 10 seeds)ModelMassive DisruptionRandom DisruptionM-R DifferenceRoPE+115,929% ± 0.0%+0.6% ± 0.7%+115,929%DroPE+1,421% ± 0.0%+0.2% ± 1.2%+1,421%Figure 4: Perplexity after disruption (log scale). Zeroing massive values breaks RoPE (PPL 1 -> 1508) but only degrades DroPE (PPL 1.5 -> 23). Random controls cause negligible damage.Statistical validation:We do some quick statistical tests because this is so cheap to do.Paired t-test (massive vs random): p < 10⁻⁴⁸ for RoPE, p < 10⁻²⁹ for DroPEIndependent t-test (RoPE vs DroPE): p < 10⁻⁸⁷Cohen’s d > 1000So I feel fairly confident that these results are significant!Key ratio: RoPE relies 82× more on massive values than DroPEConsistency Across Text TypesText TypeRoPE PPL IncreaseDroPE PPL IncreaseLiterary+116,000%+1,400%Technical+115,800%+1,450%Repetitive+116,100%+1,380%Results are pretty consistent regardless of text content.InterpretationRoPE model: Zeroing massive values completely breaks the model. The model cannot function without these concentrated activations.DroPE model: Zeroing massive values degrades but doesn’t break the model. The model has learned alternative mechanisms that partially compensate.Control condition: Zeroing random dimensions causes negligible damage in both models, proving massive values are specifically important, not just any high-norm dimensions.Basically, both models have massive values, but RoPE is catastrophically dependent on them while DroPE is not.TakesThe apparent contradiction between the papers dissolves once we distinguish where massive values come from versus how they’re used:Massive values are learned into weights during RoPE training, as opposed to being created by RoPE at inference. The projection matrices W_Q and W_K develop these concentration patterns because RoPE’s frequency structure during training creates gradients that favor certain dimensions.RoPE at inference makes massive values functionally critical. The rotation operation couples these concentrated features to position-dependent attention patterns. Remove the rotation, and the model breaks because the model doesn’t know how to use them without positional modulation.DroPE recalibration teaches alternative usage patterns. During the brief recalibration phase, it seems the model learns to: Reduce concentrationDistribute information more evenly across dimensionsPerform attention based on content similarity aloneWhy did I do this?Understanding how and why large language models work in a principled way will require knowing the internal mechanisms of the transformer stack very deeply. While many components (such as attention, MLPs, and residual connections) are now relatively well studied, positional encoding remains surprisingly opaque (at least, to me). In particular, Rotary Positional Embeddings (RoPE) is both weird and brittle: they strongly shape attention behavior, impose hard-to-reason-about constraints on context length, and interact nontrivially with model scaling, quantization, and alignment. I find RoPE is often wonky to work with, where small changes in frequency scaling or context length can produce disproportionate failures, and extending context reliably seems like it will require delicate engineering. Also, I had a free hour, Claude Code, and reread Yoav’s paper while at the gym earlier this morning.Limitations and Future WorkModels tested: We examined only Llama-2-7B, as it was the largest DroPE model mentioned in the paper. Also, I can’t find the other models on HuggingFace. Larger models and different architectures may show different patterns.Recalibration dynamics: We compared endpoints (RoPE vs. fully recalibrated DroPE). Tracking massive values during recalibration would reveal how redistribution occurs. ITask-specific analysis: We measured perplexity. Testing on Jin et al.’s contextual vs. parametric knowledge tasks would directly validate whether DroPE’s reorganization preserves contextual understanding through alternative mechanisms. I’m doing this as we speak.ReproducibilityCodeAll experiments can be reproduced here:# Massive value comparison
python scripts/run_massive_values_rigorous.py

# Disruption experiment
python scripts/run_disruption_rigorous.pyHardwareGPU: NVIDIA A10G (24GB)Models loaded in 4-bit quantizationParametersParameterValueSourceλ (massive threshold)5.0Jin et al. 2025Sequence length512 tokensStandardNumber of text samples10Diverse corpusNumber of random seeds10Statistical validationCitationIf you use these findings, please cite:@article{jin2025massive,
title={Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding},
author={Jin, Mingyu and others},
journal={ICML},
year={2025}
}

@article{gelberg2025drope,
title={Dropping Positional Embeddings for Zero-Shot Long-Context Extension},
author={Gelberg, Tal and others},
journal={arXiv preprint arXiv:2512.12167},
year={2025}
}

@techreport{africa2026massive,
title = {Massive Activations in DroPE: Evidence for Attention Reorganization},
author = {Africa, David},
year = {2026},
url = {https://github.com/DavidDemitriAfrica/drope-activations}
}Discuss ​Read More

​Published on January 18, 2026 3:05 PM GMTSummaryI do a quick experiment to investigate how DroPE (Dropping Positional Embeddings) models differ from standard RoPE models in their use of “massive values”  (that is, concentrated large activations in Query and Key tensors) that prior work identifies as important for contextual understanding. I did this in my personal time, for fun.Two main findings:DroPE reduces massive value concentration significantly in Query tensors compared to RoPE.RoPE relies way more on massive values than DroPE, and disrupting them breaks RoPE but only degrades DroPE.These findings suggest that, during recalibration, DroPE learns alternative attention mechanisms that don’t depend on concentrated features.BackgroundWhat Are Massive Values?Massive values are unusually large activations in the Query (Q) and Key (K) tensors of transformer attention layers. They were identified by Jin et al. (2025) to have the pattern of:Being concentrated in low-frequency RoPE dimensionsBeing present in Q and K but notably absent in VBeing critical for contextual knowledge understanding tasks (passkey retrieval, sentiment analysis, mathematical reasoning) but not for parametric knowledge retrieval (factual recall)Jin et al. provide a mechanistic explanation rooted in RoPE’s frequency structure. RoPE divides the head dimension into pairs, each rotating at a frequency θj=10000(−2j/d). High-frequency components (small j) change rapidly with position, encoding fine-grained positional information. Low-frequency components (large j) change slowly, and Jin et al. argue these dimensions primarily encode semantic content rather than position. They find that when they disrupt these massive values, this devastates contextual understanding tasks while parametric knowledge tasks show only a little bit of degradation.What Is DroPE?DroPE (Gelberg et al., 2025) is a method that removes Rotary Position Embeddings (RoPE) from pretrained models and recalibrates them, which has the effect of zero-shot extending context length.The claim here is like: RoPE scaling methods (PI, YaRN, NTK) attempt to extend context by compressing rotation frequencies. But low-frequency RoPE components never complete a full rotation during training (ϕm(Ctrain)<2π for small ωₘ). At extended lengths, these phases become out-of-distribution, so any scaling method must compress low frequencies by factor 1/s to keep phases in range. But this compression shifts attention weights at long distances, where semantic matching matters a lot.These papers make seemingly incompatible claims.Jin et al. claim that RoPE -> massive values -> essential for contextual understandingGelberg et al. claim that remove RoPE -> better context extension with preserved capabilitiesIf massive values are caused by RoPE and critical for understanding, how does DroPE maintain performance?So we can check a pretty neat and well-scoped research question: are massive values a cause or consequence of contextual knowledge capabilities? And the proxy test we can do cheaply here is: does DroPE, after recalibration, still have massive values?Experiment 1: Massive Value ComparisonMethodologyModels compared:meta-llama/Llama-2-7b-hf (standard RoPE)SakanaAI/Llama-2-7b-hf-DroPE (RoPE removed + recalibrated)Procedure:Load both models with identical tokenizerProcess N diverse text samplesExtract Q, K, V tensors from all 32 layers using forward hooks on projection outputsCompute L2 norm matrix M[head, dim] for each tensorCount positions where M > 5.0 × mean(M) (the definition for massive values)Repeat across multiple samples and report mean plus minus stdText samples used: 10  texts including:Literary: Hobbit, Tale of Two Cities, Moby Dick excerptsTechnical: ML/transformer descriptionsConversational: Dialogue snippetsFactual: Scientific descriptionsResultsTensorRoPE (mean ± std)DroPE (mean ± std)ChangeQuery1475.5 ± 22.6901.4 ± 36.0-38.9%Key1496.8 ± 69.81331.5 ± 74.1-11.0%Value174.0 ± 10.7176.6 ± 5.7+1.5% Figure 1: Massive value counts for Query, Key, and Value tensors. Error bars show ±1 standard deviation across 10 text samples. DroPE shows 39% reduction in Query and 11% reduction in Key.We also plot this across layers.Figure 2: Query massive values by layer. The shaded area shows the reduction from RoPE to DroPE. DroPE consistently has ~17 fewer massive values per layer.InterpretationHow do we interpret these results? Query shows the largest reduction in number of massive values. Roughly, the Query tensor encodes “what to look for” in attention, which is the model’s representation of what information the current position needs. DroPE models learn to distribute this information more evenly across dimensions rather than concentrating it in the low-frequency RoPE dimensions.Key shows moderate reduction in number of massive values. Roughly, the Key tensor encodes “what information is here” at each position. The smaller reduction suggests some concentration patterns persist, possibly because Key representations must still support some semantic matching.Value is unchanged, within error bars. Mostly just confirms the Jin et al. finding.Low variance across text types (std ~2-5% of mean) indicates this is a robust structural property of the models, not dependent on input content.However, a closer look at Figure 2 shows DroPE didn’t uniformly reduce massive values.Figure 3: Layer 1 is the only layer where DroPE has MORE massive values than RoPE. This suggests DroPE concentrates some position-independent processing in the first layer.Not sure how to interpret this. Possibly, without positional embeddings, DroPE may use layer 1 to establish token relationships through content alone, then rely less on concentrated features in subsequent layers.Experiment 2: Disruption ExperimentMotivationFinding 1 shows DroPE has fewer massive values, but are these values still functionally important? We test this by zeroing out massive value dimensions and measuring model degradation.MethodologyProcedure:Identify massive value dimensions in Q and K projections (threshold λ=5.0)Register forward hooks that zero out these specific dimensionsMeasure perplexity on held-out text before and after disruptionCompare to control: zeroing same number of random dimensionsRepeat with 10 different random seeds for control conditionDisruption implementation:# Hook on q_proj output
def hook(module, input, output):
# mask: boolean tensor where True = massive value dimension
zero_mask = (~mask).to(output.dtype) # 0 where massive, 1 elsewhere
return output * zero_mask # Zero out massive dimensionsMetric: Our metric of choice here is M-R Difference = (Massive disruption PPL increase) – (Random disruption PPL increase) and we interpret it asHigher M-R difference = model relies more on massive values specificallyResultsRaw Perplexity ValuesModelBaselineMassive ZeroedRandom ZeroedRoPE1.301,508.51.31DroPE1.4922.71.49Percent Increase (mean ± std across 10 seeds)ModelMassive DisruptionRandom DisruptionM-R DifferenceRoPE+115,929% ± 0.0%+0.6% ± 0.7%+115,929%DroPE+1,421% ± 0.0%+0.2% ± 1.2%+1,421%Figure 4: Perplexity after disruption (log scale). Zeroing massive values breaks RoPE (PPL 1 -> 1508) but only degrades DroPE (PPL 1.5 -> 23). Random controls cause negligible damage.Statistical validation:We do some quick statistical tests because this is so cheap to do.Paired t-test (massive vs random): p < 10⁻⁴⁸ for RoPE, p < 10⁻²⁹ for DroPEIndependent t-test (RoPE vs DroPE): p < 10⁻⁸⁷Cohen’s d > 1000So I feel fairly confident that these results are significant!Key ratio: RoPE relies 82× more on massive values than DroPEConsistency Across Text TypesText TypeRoPE PPL IncreaseDroPE PPL IncreaseLiterary+116,000%+1,400%Technical+115,800%+1,450%Repetitive+116,100%+1,380%Results are pretty consistent regardless of text content.InterpretationRoPE model: Zeroing massive values completely breaks the model. The model cannot function without these concentrated activations.DroPE model: Zeroing massive values degrades but doesn’t break the model. The model has learned alternative mechanisms that partially compensate.Control condition: Zeroing random dimensions causes negligible damage in both models, proving massive values are specifically important, not just any high-norm dimensions.Basically, both models have massive values, but RoPE is catastrophically dependent on them while DroPE is not.TakesThe apparent contradiction between the papers dissolves once we distinguish where massive values come from versus how they’re used:Massive values are learned into weights during RoPE training, as opposed to being created by RoPE at inference. The projection matrices W_Q and W_K develop these concentration patterns because RoPE’s frequency structure during training creates gradients that favor certain dimensions.RoPE at inference makes massive values functionally critical. The rotation operation couples these concentrated features to position-dependent attention patterns. Remove the rotation, and the model breaks because the model doesn’t know how to use them without positional modulation.DroPE recalibration teaches alternative usage patterns. During the brief recalibration phase, it seems the model learns to: Reduce concentrationDistribute information more evenly across dimensionsPerform attention based on content similarity aloneWhy did I do this?Understanding how and why large language models work in a principled way will require knowing the internal mechanisms of the transformer stack very deeply. While many components (such as attention, MLPs, and residual connections) are now relatively well studied, positional encoding remains surprisingly opaque (at least, to me). In particular, Rotary Positional Embeddings (RoPE) is both weird and brittle: they strongly shape attention behavior, impose hard-to-reason-about constraints on context length, and interact nontrivially with model scaling, quantization, and alignment. I find RoPE is often wonky to work with, where small changes in frequency scaling or context length can produce disproportionate failures, and extending context reliably seems like it will require delicate engineering. Also, I had a free hour, Claude Code, and reread Yoav’s paper while at the gym earlier this morning.Limitations and Future WorkModels tested: We examined only Llama-2-7B, as it was the largest DroPE model mentioned in the paper. Also, I can’t find the other models on HuggingFace. Larger models and different architectures may show different patterns.Recalibration dynamics: We compared endpoints (RoPE vs. fully recalibrated DroPE). Tracking massive values during recalibration would reveal how redistribution occurs. ITask-specific analysis: We measured perplexity. Testing on Jin et al.’s contextual vs. parametric knowledge tasks would directly validate whether DroPE’s reorganization preserves contextual understanding through alternative mechanisms. I’m doing this as we speak.ReproducibilityCodeAll experiments can be reproduced here:# Massive value comparison
python scripts/run_massive_values_rigorous.py

# Disruption experiment
python scripts/run_disruption_rigorous.pyHardwareGPU: NVIDIA A10G (24GB)Models loaded in 4-bit quantizationParametersParameterValueSourceλ (massive threshold)5.0Jin et al. 2025Sequence length512 tokensStandardNumber of text samples10Diverse corpusNumber of random seeds10Statistical validationCitationIf you use these findings, please cite:@article{jin2025massive,
title={Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding},
author={Jin, Mingyu and others},
journal={ICML},
year={2025}
}

@article{gelberg2025drope,
title={Dropping Positional Embeddings for Zero-Shot Long-Context Extension},
author={Gelberg, Tal and others},
journal={arXiv preprint arXiv:2512.12167},
year={2025}
}

@techreport{africa2026massive,
title = {Massive Activations in DroPE: Evidence for Attention Reorganization},
author = {Africa, David},
year = {2026},
url = {https://github.com/DavidDemitriAfrica/drope-activations}
}Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *