Opinion

Preliminary Results on Building Graphs from SAEs

​TLDR: I use nodewise LASSO to build approximate sparse conditional dependence graphs over SAE features, with resampling and null controlsInitial experiments produce graphs with small standalone modules that are stable under resamplingThese modules frequently correspond to coherent-looking linguistic features, and are only weakly aligned with cosine similarityThis should be read primarily as a proof-of-concept, and methodological refinement is ongoingMotivationIn practice, SAE features exhibit redundancy, which we see in phenomena like feature splitting, absorption, and duplication. We can look at SAE features through the lens of cosine similarity of feature weights or using correlation between activations. These can capture some similarity between features, but don’t model conditional dependence between features. I believe that if SAE features have redundancy or hierarchy, their activations should show sparse conditional dependence structure. If such dependency structure exists, we should be able to model it with methods that build a precision graph between features.My approach here models this conditional dependence structure downstream of activation correlations, with a goal of better understanding SAE feature geometry, and sits between circuits-type approaches and SAE work. To my knowledge, conditional dependence between SAE features has not yet been modeled in prior work.MethodologyRun nodewise LASSO on random activations to approximate linear dependence between SAE features, using repeat and null trials to control against dataset bias.I’ve built a pipeline to test this theory. My approach is as follows:Sample activations from a SAE at a given layer on random sequences Pre-screen candidate neighbors for each SAE featureCalculate correlations between features over activations, convert to Fisher z-scores, and use BH-FDR to control false discoveryKeep the top $~k$ candidates per nodeFor each SAE feature:Perform a parameter sweep to tune the LASSO regularization coefficientRun LASSO to identify conditionally dependent neighbors for this featureMerge edges from each nodewise sweep, keeping bidirectional edges onlyRepeat under random sampling to test stability – I do 30 resamples + a matched null trial (feature-wise shuffle) for each resampleIdentify edges that appear consistently across resamples and pass BH-FDR vs null trialsThis procedure approximates a conditional-dependence graph, but is not an exact estimator.Initial ResultsSmall scale trials identify linguistically coherent looking modules that are only weakly aligned with cosine similarity.I’ve successfully run this approach on a single SAE layer for models from gpt2-small up to gemma2:27b.Here, I’m presenting results from gemma2:27b (Gemmascope, layer_10/width_131k) as the largest model I’ve tested so far.Sequences are randomly sampled from fineweb.Base Modelgemma2:27bSAE Modelgemmascope, layer_10/width_131kDatasetfinewebActivation samples per feature200Activation Sparsity (30 trial mean)0.99897Average features per token (30 trial mean)133.7For downstream analysis, I keep edges from the repeat trials using BH-FDR to compare the stability of each edge relative to the null trials, with as the threshold. I find that this model + dataset pair meets the FDR threshold of 0.1 at replicates, with 57,047 retained edges.Number of Repeats An Edge is Seen InObserved EdgesExpected Null EdgesEstimated FDR1204,609196,2739.59e-01257,0471,9813.47e-02331,4291103.50e-03515,16516.59e-05105,28600201,389003030200Note that for this preliminary experiment I’m using only a single null resampling per trial.Importantly, I see that cosine similarity and edge strength in the discovered graph are weakly correlated with high variance (see below, this is the set of stable edges). For qualitative interpretation here, I restrict manual inspection to the much smaller subset of stable edges on a small subset of nodes. Within this set of 227 nodes and 302 edges, I end up with one large connected component (98 nodes), a handful of smaller standalone components ( nodes), and a large number of tiny components. The group of tiny components () appears to mostly be duplicated features with strong cosine similarity, and so I’ve filtered those elements out.Small componentsThe smaller standalone components with features look like linguistically coherent features under manual inspection (mostly grammar and context related):Component 2 (9 nodes): possessive pronouns + some context. Component 3 (9 nodes): “to be” + some descriptive contextComponent 4 (7 nodes): “which/who/that” followed by verbs, plus explanatory context (e.g. we prove that, that maps, that means) Component 5 (5 nodes): “has been” either alone or followed by specific words Component 6 (5 nodes): negations (not, didn’t) + verb and article contextLarge componentThe large connected cluster contains subsets of communities that can be identified with community detection. Note that for the sake of legibility I don’t label nodes in the following graph of all clusters.My current read is that these communities are not as linguistically “clean” as the standalone components — many have a node that doesn’t fit the rest of the “theme” of the community.For exampleCluster 8: proper names / place names / labels. Cluster 4: apostrophes, numbers/dates, citation/legal-style tokens. Cluster 7: “the + noun/category” phrase patterns. Cluster 2/3: connective / clause / punctuation-ish syntax groupings, but messy.In addition, I also see a set of catch-all miscellaneous communities as well as tiny clusters. These catch-all communities (Clusters 0 and 6) describe a variety of features across technical jargon, legal language, code and formatting, and other rare things like specific formal names.Many of the small communities are incoherent, but some show interesting features. A notable example here is a community that describes uses of the number ‘2’ as a prefix in different contexts. Note that these graphs are downstream of the correlation between activations in the pre-screening step. They should be interpreted as a sparse refinement of correlation structure towards conditional dependence, rather than a standalone finding separate from correlation.Caveats and LimitationsThere are several limitations to this approach I think you should keep in mind:This isn’t finding a true precision graph, because SAE activations are not Gaussian The output graph is dependent on correlations from the pre-screening step, which means that this is a refinement on correlation structure of activations. Thus this method should be read as sparsifying the correlation structure of SAE activations in the direction of linear dependence. What I’m calling “stable” here means stability under resampling over the same dataset, which doesn’t say anything about underlying feature faithfulness. At the current FDR threshold, stability is an extreme constraint (302 edges over 131k nodes), and the standalone clusters seem to mostly be linguistic backbone. As implemented right now, this method has hyperparameters that need either more thorough sweeps or methodological changes to remove them.Next StepsRight now, I see two primary directions this project needs to take: Refining some of the steps in the pipeline to (a) confidently use the full 50k something edge repeat set and (b) remove some of the current design’s hyperparameter dependence Scaling this to all layers in a given model and figuring out how to connect graphs across layers accounting for the residual streamAdditionally, I’ll also be looking at:How different datsets may change the output graphs and how to control against that Where this might land in relation to feature splitting and absorption What, if any, hierarchical structure between features we might be able to infer from these graphsFunding Note: This work is funded by a Coefficient Giving TAIS grant.Discuss ​Read More

Preliminary Results on Building Graphs from SAEs

​TLDR: I use nodewise LASSO to build approximate sparse conditional dependence graphs over SAE features, with resampling and null controlsInitial experiments produce graphs with small standalone modules that are stable under resamplingThese modules frequently correspond to coherent-looking linguistic features, and are only weakly aligned with cosine similarityThis should be read primarily as a proof-of-concept, and methodological refinement is ongoingMotivationIn practice, SAE features exhibit redundancy, which we see in phenomena like feature splitting, absorption, and duplication. We can look at SAE features through the lens of cosine similarity of feature weights or using correlation between activations. These can capture some similarity between features, but don’t model conditional dependence between features. I believe that if SAE features have redundancy or hierarchy, their activations should show sparse conditional dependence structure. If such dependency structure exists, we should be able to model it with methods that build a precision graph between features.My approach here models this conditional dependence structure downstream of activation correlations, with a goal of better understanding SAE feature geometry, and sits between circuits-type approaches and SAE work. To my knowledge, conditional dependence between SAE features has not yet been modeled in prior work.MethodologyRun nodewise LASSO on random activations to approximate linear dependence between SAE features, using repeat and null trials to control against dataset bias.I’ve built a pipeline to test this theory. My approach is as follows:Sample activations from a SAE at a given layer on random sequences Pre-screen candidate neighbors for each SAE featureCalculate correlations between features over activations, convert to Fisher z-scores, and use BH-FDR to control false discoveryKeep the top $~k$ candidates per nodeFor each SAE feature:Perform a parameter sweep to tune the LASSO regularization coefficientRun LASSO to identify conditionally dependent neighbors for this featureMerge edges from each nodewise sweep, keeping bidirectional edges onlyRepeat under random sampling to test stability – I do 30 resamples + a matched null trial (feature-wise shuffle) for each resampleIdentify edges that appear consistently across resamples and pass BH-FDR vs null trialsThis procedure approximates a conditional-dependence graph, but is not an exact estimator.Initial ResultsSmall scale trials identify linguistically coherent looking modules that are only weakly aligned with cosine similarity.I’ve successfully run this approach on a single SAE layer for models from gpt2-small up to gemma2:27b.Here, I’m presenting results from gemma2:27b (Gemmascope, layer_10/width_131k) as the largest model I’ve tested so far.Sequences are randomly sampled from fineweb.Base Modelgemma2:27bSAE Modelgemmascope, layer_10/width_131kDatasetfinewebActivation samples per feature200Activation Sparsity (30 trial mean)0.99897Average features per token (30 trial mean)133.7For downstream analysis, I keep edges from the repeat trials using BH-FDR to compare the stability of each edge relative to the null trials, with as the threshold. I find that this model + dataset pair meets the FDR threshold of 0.1 at replicates, with 57,047 retained edges.Number of Repeats An Edge is Seen InObserved EdgesExpected Null EdgesEstimated FDR1204,609196,2739.59e-01257,0471,9813.47e-02331,4291103.50e-03515,16516.59e-05105,28600201,389003030200Note that for this preliminary experiment I’m using only a single null resampling per trial.Importantly, I see that cosine similarity and edge strength in the discovered graph are weakly correlated with high variance (see below, this is the set of stable edges). For qualitative interpretation here, I restrict manual inspection to the much smaller subset of stable edges on a small subset of nodes. Within this set of 227 nodes and 302 edges, I end up with one large connected component (98 nodes), a handful of smaller standalone components ( nodes), and a large number of tiny components. The group of tiny components () appears to mostly be duplicated features with strong cosine similarity, and so I’ve filtered those elements out.Small componentsThe smaller standalone components with features look like linguistically coherent features under manual inspection (mostly grammar and context related):Component 2 (9 nodes): possessive pronouns + some context. Component 3 (9 nodes): “to be” + some descriptive contextComponent 4 (7 nodes): “which/who/that” followed by verbs, plus explanatory context (e.g. we prove that, that maps, that means) Component 5 (5 nodes): “has been” either alone or followed by specific words Component 6 (5 nodes): negations (not, didn’t) + verb and article contextLarge componentThe large connected cluster contains subsets of communities that can be identified with community detection. Note that for the sake of legibility I don’t label nodes in the following graph of all clusters.My current read is that these communities are not as linguistically “clean” as the standalone components — many have a node that doesn’t fit the rest of the “theme” of the community.For exampleCluster 8: proper names / place names / labels. Cluster 4: apostrophes, numbers/dates, citation/legal-style tokens. Cluster 7: “the + noun/category” phrase patterns. Cluster 2/3: connective / clause / punctuation-ish syntax groupings, but messy.In addition, I also see a set of catch-all miscellaneous communities as well as tiny clusters. These catch-all communities (Clusters 0 and 6) describe a variety of features across technical jargon, legal language, code and formatting, and other rare things like specific formal names.Many of the small communities are incoherent, but some show interesting features. A notable example here is a community that describes uses of the number ‘2’ as a prefix in different contexts. Note that these graphs are downstream of the correlation between activations in the pre-screening step. They should be interpreted as a sparse refinement of correlation structure towards conditional dependence, rather than a standalone finding separate from correlation.Caveats and LimitationsThere are several limitations to this approach I think you should keep in mind:This isn’t finding a true precision graph, because SAE activations are not Gaussian The output graph is dependent on correlations from the pre-screening step, which means that this is a refinement on correlation structure of activations. Thus this method should be read as sparsifying the correlation structure of SAE activations in the direction of linear dependence. What I’m calling “stable” here means stability under resampling over the same dataset, which doesn’t say anything about underlying feature faithfulness. At the current FDR threshold, stability is an extreme constraint (302 edges over 131k nodes), and the standalone clusters seem to mostly be linguistic backbone. As implemented right now, this method has hyperparameters that need either more thorough sweeps or methodological changes to remove them.Next StepsRight now, I see two primary directions this project needs to take: Refining some of the steps in the pipeline to (a) confidently use the full 50k something edge repeat set and (b) remove some of the current design’s hyperparameter dependence Scaling this to all layers in a given model and figuring out how to connect graphs across layers accounting for the residual streamAdditionally, I’ll also be looking at:How different datsets may change the output graphs and how to control against that Where this might land in relation to feature splitting and absorption What, if any, hierarchical structure between features we might be able to infer from these graphsFunding Note: This work is funded by a Coefficient Giving TAIS grant.Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *