Opinion

Abstract/Concrete Axis: Robustness Testing Semantic Feature Separation in Gemma 3 270M

Published on February 23, 2026 8:07 PM GMTIn my previous post, I found that abstract and concrete prompts activate almost entirely separate feature clusters in Gemma 3 270M’s sparse autoencoder decomposition. The abstract side looked like reasoning operations (qualification, problems/issues); the concrete side looked like physical domain ontologies (composition, geology). The Jaccard similarity between their active feature sets was just 0.102 at layer 15, and this separation was constructed progressively through layer depth.The most important limitation that was raised was that all prompts shared a similar hand-crafted structure. A natural question: is this separation a property of the concepts themselves, or of my prompt templates? This post attempts to answer that.I’d like to flag that this is independent work and I welcome feedback. A community suggestion from my last post directly motivated these experiments. I was asked about prompt frame sensitivity, single-word prompts, and corpus-extracted sentences. I’ve now tested all three.Before jumping in, I’d like to highlight what this axis looks like in feature space as I’m a sucker for a visual. The tight abstract cluster vs. the dispersed concrete features previews the main finding of this post: the axis is real, but asymmetric. The rest of this post tests whether this structure holds up under different conditions:Figure 1: The axis in feature spaceAs far as setup, this is the same model and tooling as the previous post: Gemma 3 270M (pretrained, not instruction-tuned), Gemma Scope 2 16k SAEs at layer 15. I kept the same 10 abstract and 10 concrete content words from the original study. The abstract words are things like immigration policy, censorship, governance, privacy, inequality; the concrete words are things like limestone, volcano, blood, granite, atmosphere. The threshold for “active” is still > 0, and features are considered “biased” if their mean activation is 2x higher on one side with a minimum of 50.Experiment 1: Prompt Frame VariationThe idea here was simple: take the same 20 content words and present each in 5 different prompt frames. The original study used category-specific templates (“A nuance of the debate over immigration policy is” for abstract, “One of the components of limestone is” for concrete). The new frames strip away that scaffolding to varying degrees:Original: category-specific templates from the published studyCopula: “Limestone is” / “Immigration policy is”Imperative: “Describe limestone:”About: “Tell me about limestone”Concept: “The concept of limestone”That gives 10 words × 5 frames × 2 categories = 100 prompts total.The axis persists across all five frames. The original template shows the strongest separation (Jaccard 0.059), while the more generic frames cluster around 0.17–0.24. The mean across all frames was 0.169. None of the bootstrap confidence intervals come anywhere near the 0.28–0.29 range I saw for the weak axes (positive/negative, animate/inanimate) in the previous post.Figure 2: Separation of the axis using different promptsThis tells me that the original prompt design amplified the separation but didn’t create it. The category-specific frames acted as a tighter filter, activating fewer but more category-distinctive features (the original frame produced only 90 abstract-active features vs. 244–288 for the generic frames). Less scaffolding means more features fire overall, increasing the overlap — but the core separation remains.The more surprising finding to me was the asymmetry between abstract and concrete. I computed which features were consistently biased across frames: a feature counts as “robust” if it’s biased in 4 or more of the 5 frames. The concrete side has 57 robust features with 33 biased in all 5 frames. The abstract side has only 18 robust features with just 4 biased in all 5 frames.Figure 3: Feature robustness across framesThe concrete side’s robustness profile is flatter — its features are more intrinsically tied to the words. The abstract side drops off steeply, meaning most abstract-biased features are frame-dependent. Concrete concepts like “limestone” activate similar features regardless of how you frame the sentence; abstract concepts like “immigration policy” activate different feature neighborhoods depending on the surrounding language.I looked up the robust features on Neuronpedia, and the picture it paints is mixed. About 70% of the robust concrete features correspond to physical/perceptual domains: f11 (biological classification), f76 (blood), f317 (astronomy), f413 (woodwork/architecture), f437 (mineralogy). But the remaining 30% appear to be coincidental correlations with my word set — f83 fires for “the letter m,” likely because several concrete words contain m. On the abstract side, I see a similar mix: f200 (problems/issues) and f380 (comparison/relations) from the original study persist, but f7 (AI/technology) and f93 (gender/culture) seem more topical than ontological.One thing I found interesting about the concrete features is that they’re broader than just geology/chemistry, which is what the original post suggested. Colors, abrasions, woodwork, astronomy were all persistent, and these are all “things you can see/touch/measure.” The concrete cluster might be better described as “perceptible physical world” rather than “physical science.”Experiment 2: Single Token — Bare ConceptsThis was the most extreme test: run each content word alone with zero syntactic context. Just “immigration” or “limestone” as the entire input. If the axis appears here, it’s fundamentally about the concept, not the scaffolding around it.The single-token Jaccard was 0.154 [0.113, 0.170]. This is right in the range of the framed prompts (0.169–0.236), which is a positive signal, meaning bare single tokens show comparable separation to framed prompts. The axis doesn’t collapse without context.However, the feature-level bias correlation between single-token and full-prompt conditions was weak (r = 0.211). The model separates abstract from concrete in both cases, but it’s not necessarily using the same features to do it. The axis is more like a property of the category than a fixed set of feature detectors.Figure 4: Full prompt vs single token feature biasOne practical thing I want to note, 10 of the 20 words tokenize to multiple tokens (e.g., “limestone” → “lim” + “estone”). I initially expected this to degrade the signal, but it actually went the other way. Multi-token words showed higher bias correlation (r = 0.384) than single-token words (r = 0.080). The single-token words include things like “justice,” “bridge,” and “trunk,” which have multiple meanings. Multi-token words like “bioethics” and “limestone” are more semantically specific, so their features transfer more consistently between bare and framed conditions.Experiment 3: Corpus-Extracted SentencesThe first two experiments varied the prompt structure but still used researcher-designed text. For validity, I pulled ~280 real sentences from Wikipedia containing my 20 target words. To note one key methodological difference, I extracted features at the position of the target word in each sentence, not the last token. This tests whether the model’s representation of “limestone” in the middle of a real Wikipedia sentence still carries the concrete-physical signature.The corpus Jaccard was 0.325 [0.264, 0.315] — notably higher than the controlled experiments, meaning the separation is weaker in naturalistic text. This was expected. Wikipedia sentences have varied syntactic structures, the target word appears in different positions and contexts, and each sentence brings its own feature noise. But 0.325 is still well below 0.5, so the axis is there — just noisier.The strongest result of the entire robustness battery: 100% of the frame-robust features from Experiment 1 were also active in the Wikipedia sentences. All 18 abstract-robust and all 57 concrete-robust features persisted. So these features seem to not just be artifacts of my prompt templates, they fire in real text too. The higher Jaccard in the corpus condition reflects additional features activated by diverse contexts, not the absence of the axis signal.Figure 5: Axis separation across experimentsTakeawaysThe abstract/concrete axis is real based on these experiments and my prior work. It survives five different prompt templates, bare single-word inputs, and naturalistic Wikipedia text. The separation gets noisier as context control decreases (from 0.059 under the original tightly-controlled prompts to 0.325 using the corpus) but it never disappears, and the robust features persist at 100% into corpus data.But the axis is not symmetric. Concrete concepts have 3× more frame-robust features and more stable representations across conditions. Abstract concepts are more contextually constructed, showing that the model needs surrounding language to fully activate abstract-domain features. I think this may reflect a difference in how language models organize physical vs. conceptual knowledge. Physical objects have relatively fixed properties regardless of context; abstract ideas are inherently relational and context-dependent. The model’s representations seem to mirror this.I also want to be honest about what the Neuronpedia lookups showed: roughly 30% of the robust features don’t seem to be semantic markers. They’re correlations with surface-level properties of my word set. Frame variation filters out frame-dependent features, but it can’t filter out features that correlate with the words themselves. This is a distinct confound that the corpus experiment partially addresses (the features still fire in Wikipedia, but I haven’t confirmed they’re still biased in the same direction in corpus data).One finding that surprised me from Experiment 2: the polysemy effect. Words with specific meanings maintain their bias direction between single-token and framed conditions; words with multiple meanings don’t. This suggests something about how the model might be resolving word sense. Syntactic context could be disambiguating which sense of the word the features should track, in a way. This feels like it could be a thread worth pulling on.LimitationsThe recurring limitations from my prior work carry forward: n=10 per category, single model (Gemma 3 270M), single SAE width (16k). Additionally:The ~30% artifact rate among robust features means the “clean” robust feature set is more like 12–13 abstract and 40 concrete features. Still meaningful, but smaller than the headline numbers suggest.The corpus experiment extracts at the target word position, which is a different methodology from the prior work (last-token position). These aren’t perfectly comparable.Bootstrap CIs for the corpus Jaccard show a slight downward bias from resampling-induced diversity reduction — the point estimate falls slightly above the upper CI bound. I’ve noted this but haven’t corrected for it; the conclusions don’t change.I still don’t have causal validation. Everything here is descriptive — I’m showing that certain features correlate with semantic categories, not that they causally implement the distinction. Feature ablation is still on the roadmap.Next StepsWith the prompt confound now partially addressed, I think the most impactful next steps are:Causal validation via feature ablation: ablate the robust features and see if the model’s behavior changes in ways consistent with the axis. If ablating concrete-robust features shifts the model’s predictions on concrete prompts toward abstract-like behavior, that’s causal evidence.Cross-architecture replication: testing whether the same axis and asymmetry appear in Pythia models. The training checkpoints would let me study when the axis emerges during training.Larger prompt sets with controlled confounds: designing word sets that control for the surface-level correlations that produced the ~30% artifact features. This probably means balancing for word length, frequency, and initial letter across categories.I’m eager for feedback. If you see issues with the methodology or have suggestions for extensions, please let me know. Thanks to Nathan Helm-Burger for suggesting this line of research.Discuss Read More

Related Posts

The Scalable Formal Oversight Research Program

Subliminal Learning Across Models

Do we need sparsity afterall?

Leave a Reply Cancel reply