Opinion

Ambitious Mech Interp w/ Tensor-transformers on toy languages [Project Proposal]

​This is my project proposal for Pivotal. Apply as a mentee by May 3rdThe field has accumulated a vocabulary of computational primitives (induction heads, skip-trigrams) through post-hoc analysis. We propose building a toy language from these known primitives to train tensor-transformers (see an early example in the last section)This allows us to study fundamental problems (suppression & error correction, compositionality/ circuits, dev-interp, etc) with the odds stacked in our favor:We know the data-generating process (DGP) – what the bigram statistics, the skip-trigrams, induction are and how they interfere w/ each other.Tensor-transformers makes compositionality clear-as-day (ie you can find relationships between any model components solely from the weights, whereas normal NNs require running data).This is a transformer on a language task – results learned here straightforwardly apply to real LLMsModifiable complexity – we can change the complexity of the data, number of layers, width of model, etc (in general, we can easily train a bespoke model to verify a target hypothesis).Specific Research DirectionsImprove the DGP (data-generating process) – Extend the DGP to include computational patterns beyond n-grams (eg nested structure (brackets, quotes), long-range dependencies, context-sensitive transitions, etc) learned from existing datasets.Interp-across-time – are there any dependent structures during training (eg must learn X before learning Y)? (Most similar to work by Naomi Saphra)Building interp tools – What techniques (existing or novel) can be used to find these ground truth features?Phenomenon Studies — use the controlled setup to characterize specific computational phenomena (suppression, error correction, compositional reuse) with ground-truth verification.Tensor Interp – because we’re using a tensor-transformer, there may be new techniques available to us (prior familiarity with tensor networks is a prerequisite for this direction)High Level ViewI’m shooting for a healthy feedback loop of:Use existing computational vocab (eg induction) to make a toy LLMUse (1) to improve our basic knowledge of models (eg suppression) and learn new computational vocabRepeat…ProfitIf we succeed enough loops of this process, this could work as a foundation for LLMs automating ambitious mech interp. In a sense, mech interp is already a verifiable task (ie find *simple* descriptions that replicate model behavior), but we need to resolve enough of our own confusions (& build better tools) first.If this interests you, do apply to my (& Thomas’) research stream (by May 3rd). Current Trained ModelAs an example, I’ve trained a 2-layer attn-only model. Looking at embed -> unembed: There’s lots of apparent structure. Zooming into the Verb_T/NOUN square, you can see the bigram statistics for: alice → sees(70%), helps(20%), finds(10%)bob → knows(70%), likes(20%), meets(10%)carol → calls(70%), tells(20%), sees(10%)EtcWe can also look at the slice of the QK circuit:For Skip-bigrams (4 rules, max_skip=8):  beach … big → at  garden … old → and  lake … new → or  office … small → toZooming in, you can clearly see two here:But the other two are in the top-left & top-right boxes (they’re negative, yes, but this is bilinear attn. It’s actually a negative QK * a negative OV which ends up becoming positive).Discuss ​Read More

Ambitious Mech Interp w/ Tensor-transformers on toy languages [Project Proposal]

​This is my project proposal for Pivotal. Apply as a mentee by May 3rdThe field has accumulated a vocabulary of computational primitives (induction heads, skip-trigrams) through post-hoc analysis. We propose building a toy language from these known primitives to train tensor-transformers (see an early example in the last section)This allows us to study fundamental problems (suppression & error correction, compositionality/ circuits, dev-interp, etc) with the odds stacked in our favor:We know the data-generating process (DGP) – what the bigram statistics, the skip-trigrams, induction are and how they interfere w/ each other.Tensor-transformers makes compositionality clear-as-day (ie you can find relationships between any model components solely from the weights, whereas normal NNs require running data).This is a transformer on a language task – results learned here straightforwardly apply to real LLMsModifiable complexity – we can change the complexity of the data, number of layers, width of model, etc (in general, we can easily train a bespoke model to verify a target hypothesis).Specific Research DirectionsImprove the DGP (data-generating process) – Extend the DGP to include computational patterns beyond n-grams (eg nested structure (brackets, quotes), long-range dependencies, context-sensitive transitions, etc) learned from existing datasets.Interp-across-time – are there any dependent structures during training (eg must learn X before learning Y)? (Most similar to work by Naomi Saphra)Building interp tools – What techniques (existing or novel) can be used to find these ground truth features?Phenomenon Studies — use the controlled setup to characterize specific computational phenomena (suppression, error correction, compositional reuse) with ground-truth verification.Tensor Interp – because we’re using a tensor-transformer, there may be new techniques available to us (prior familiarity with tensor networks is a prerequisite for this direction)High Level ViewI’m shooting for a healthy feedback loop of:Use existing computational vocab (eg induction) to make a toy LLMUse (1) to improve our basic knowledge of models (eg suppression) and learn new computational vocabRepeat…ProfitIf we succeed enough loops of this process, this could work as a foundation for LLMs automating ambitious mech interp. In a sense, mech interp is already a verifiable task (ie find *simple* descriptions that replicate model behavior), but we need to resolve enough of our own confusions (& build better tools) first.If this interests you, do apply to my (& Thomas’) research stream (by May 3rd). Current Trained ModelAs an example, I’ve trained a 2-layer attn-only model. Looking at embed -> unembed: There’s lots of apparent structure. Zooming into the Verb_T/NOUN square, you can see the bigram statistics for: alice → sees(70%), helps(20%), finds(10%)bob → knows(70%), likes(20%), meets(10%)carol → calls(70%), tells(20%), sees(10%)EtcWe can also look at the slice of the QK circuit:For Skip-bigrams (4 rules, max_skip=8):  beach … big → at  garden … old → and  lake … new → or  office … small → toZooming in, you can clearly see two here:But the other two are in the top-left & top-right boxes (they’re negative, yes, but this is bilinear attn. It’s actually a negative QK * a negative OV which ends up becoming positive).Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *