A Research Bet on SAE-like Expert Architectures

Interpretable by Construction: A Research Bet on SAE-like Expert Architectures
The Bet
You can build a language model architecture whose native decomposition is already close to what sparse autoencoder researchers are trying to recover post-hoc: a large pool of small, sparsely-activated, approximately-monosemantic units whose contributions to the residual stream are individually legible. If the bet pays off, we get interpretability as a structural property of the model rather than a reconstruction problem layered on top of it. If it fails, we learn something specific about why the SAE-style decomposition is harder to build in than to extract, which is itself worth knowing.
I’ve been working on this for a while now, building on the PEER (Parameter Efficient Expert Retrieval) and MONET (Mixture of Monosemantic Experts for Transformers) architectures. This post is a status report and a call for collaborators.
Aspiration
SAEs and sparse expert architectures are aimed at the same target from opposite directions. SAE research starts with a dense trained model and searches for a sparse, monosemantic decomposition of its activations. Expert architectures start with a sparse decomposition built into the weights and try to make the resulting model competitive. The interesting question is whether the second direction can reach the destination the first direction is aiming at — and at what training-efficiency cost.
I want to be clear that my current architecture is not there yet. “Interpretable by construction” is the guiding vision, not a property I’ve demonstrated.
What the architecture currently gives me is:
A hierarchical routing mechanism (mixture of expert-pools which contain populations of tiny intended-to-be-monosemantic experts) that produces domain-level specialization without supervision. Expert pools cluster around code, biomedical text, academic citations, and so on.
The small, independently-parameterized rank-1 experts each implement a function simple enough to characterize directly.
Still To Do
What it does not yet give me, and what “SAE-like” would actually require:
Monosemanticity at the unit level
My goal is feature-level monosemanticity. Functional legibility of individual experts. Knowing what an expert tends to fire on is not equivalent to knowing what it computes.
Strong causal faithfulness
Topic correlations are the easy version of the claim. The harder version is that the expert’s learned function explains its behavioral contribution mechanistically.
Competitive performance at scale
My experiments so far have been < 1B parameter training runs, for under 24 hours on one or two GPUs. The trends on my tiny prototypes look promising, but I won’t have confidence that this will scale to hundreds of billions of params until I see it work for at least the 8B scale.
So the project is best understood as a wager that architectural pressure toward sparsity and specialization can produce a model where the SAE-style decomposition is not only free, but fundamentally part of the causal mechanism. I have enough early evidence to think the bet seems promising; I don’t have enough to be confident it will work in full and at scale.
Discuss Read More

Related Posts

How Do We Evaluate the Quality of LLMs’ Mathematical Responses?

Announcing Gemma Scope 2

Disjunctive arguments can be a reverse multiple-stage fallacy

Leave a Reply Cancel reply