Ambitious Mech Interp w/ Tensor-transformers on toy languages [Project Proposal]

This is my project proposal for Pivotal. Apply as a mentee by May 3rdThe field has accumulated a vocabulary of computational primitives (induction heads, skip-trigrams) through post-hoc analysis. We propose building a toy language from these known primitives to train tensor-transformers (see an early example in the last section)This allows us to study fundamental problems (suppression & error correction, compositionality/ circuits, dev-interp, etc) with the odds stacked in our favor:We know the data-generating process (DGP) – what the bigram statistics, the skip-trigrams, induction are and how they interfere w/ each other.Tensor-transformers makes compositionality clear-as-day (ie you can find relationships between any model components solely from the weights, whereas normal NNs require running data).This is a transformer on a language task – results learned here straightforwardly apply to real LLMsModifiable complexity – we can change the complexity of the data, number of layers, width of model, etc (in general, we can easily train a bespoke model to verify a target hypothesis).Specific Research DirectionsImprove the DGP (data-generating process) – Extend the DGP to include computational patterns beyond n-grams (eg nested structure (brackets, quotes), long-range dependencies, context-sensitive transitions, etc) learned from existing datasets.Interp-across-time – are there any dependent structures during training (eg must learn X before learning Y)? (Most similar to work by Naomi Saphra)Building interp tools – What techniques (existing or novel) can be used to find these ground truth features?Phenomenon Studies — use the controlled setup to characterize specific computational phenomena (suppression, error correction, compositional reuse) with ground-truth verification.Tensor Interp – because we’re using a tensor-transformer, there may be new techniques available to us (prior familiarity with tensor networks is a prerequisite for this direction)High Level ViewI’m shooting for a healthy feedback loop of:Use existing computational vocab (eg induction) to make a toy LLMUse (1) to improve our basic knowledge of models (eg suppression) and learn new computational vocabRepeat…ProfitIf we succeed enough loops of this process, this could work as a foundation for LLMs automating ambitious mech interp. In a sense, mech interp is already a verifiable task (ie find *simple* descriptions that replicate model behavior), but we need to resolve enough of our own confusions (& build better tools) first.If this interests you, do apply to my (& Thomas’) research stream (by May 3rd). Current Trained ModelAs an example, I’ve trained a 2-layer attn-only model. Looking at embed -> unembed: There’s lots of apparent structure. Zooming into the Verb_T/NOUN square, you can see the bigram statistics for: alice → sees(70%), helps(20%), finds(10%)bob → knows(70%), likes(20%), meets(10%)carol → calls(70%), tells(20%), sees(10%)EtcWe can also look at the slice of the QK circuit:For Skip-bigrams (4 rules, max_skip=8): beach … big → at garden … old → and lake … new → or office … small → toZooming in, you can clearly see two here:But the other two are in the top-left & top-right boxes (they’re negative, yes, but this is bilinear attn. It’s actually a negative QK * a negative OV which ends up becoming positive).Discuss Read More

Related Posts

Subscribe to my Inkhaven feed!

Some models don’t identify with their official name

How the aliens next door shower

Leave a Reply Cancel reply