This is the introduction to the Which Circuit is it? sequence. We will develop some basic concepts and introduce the non-identifiability problem. This project is done in collaboration with Groundless. Recent work[1] has shown that, under commonly used criteria for Mechanistic Interpretability (MI), neural network explanations are not unique, not even in toy models like tiny Multilayer Perceptrons (MLPs). This is referred to as the non-identifiability of explanations. Why is non-identifiability a problem for Mechanistic Interpretability? Literature[2] points out three ways in which non-identifiability makes Mechanistic Interpretability fragile:Poor GeneralizationSensitivity to Design ChoicesIncreased Risk of False Discoveries Throughout this sequence, we will attempt to elucidate what each failure mode entails. In this first post, we take a step back to focus on foundational concepts, setting the stage for the experiments and reflections we will conduct in subsequent posts. Let’s start with some basic questions:What is an Explanation?Roughly[3], an explanation is a process that generates predictions from a setting to account for outcomes. Example: Physics ModelLet’s imagine we drop a ball from the top of a building and see it bounce. We are interested in explaining why it bounced instead of shattering.Setting: This encompasses the initial conditions of our physical system, such as the building’s height, the materials of the ball and the ground, and the initial positions and velocities.In our example, the setting is the input variable sto our physics model, the system state at Outcomes: This is the resulting phenomenon of interest, the end state of our physical system.At , we see the ball bouncing instead of shattering, so our physical model should output a matching prediction. Initially, without having a full explanation, we have a correspondence between a particular setting and an outcome. Our observations occur at two different instances ( and ). Even though many phyisical laws are time-symmetric, our human experience only moves forward in time. In a sense, a pre-requisite for a physics explanation is a selection (and temporal ordering) of events.The locality of an explanation is connected to how we consider these correspondences. For now, we just highlight that our observations are specific. If we had multiple trials, it would be incorrect to blindly associate the setting of one trial with the outcome of another one.Nowadays, we are convinced physical laws are constant. However, that is not an a priori fact. Multiple explanations could account for how the settings lead to the observed outcome. Let’s consider a simple Newtonian physics model that ignores air resistance: Process: We can evaluate our setting by plugging it into our chosen mathematical model. Simple physics model without air drag Prediction: The equations of our model allow us to compute a prediction.Let’s imagine our model predicts the ball would shatter In this example, the explanatory process is akin to running a simulation. Our chosen physical model is deterministic, non-chaotic, analytically tractable (with a closed-form solution), and computationally low-cost. Our chosen physical model does not account for air drag so we (wrongly) predict that the ball will shatter Lastly, our observed outcomes and predictions can differ, like they do in this example. Our chosen physical model has low explainability[4] as it does not properly account for our observed outcome (we ignored air drag and got inaccurate predictions). A better explanation would have predictions that actually match observed outcomes.By comparing the observed outcome with our prediction, we can reason about accuracy, precision, uncertainty and incompleteness of our explanation. To summarize, this is a bird’s-eye view of our explanation:Previous work has developed abstractions to abstract and formalize this notion of explanation:Don’t think too much about this diagram’s category-theoretic origins. Treat this just as doodles and scribbles of what we have already discussed in our example.This is a very general way of framing explanations.How does this relate to neural networks? Neural Networks as ExplanationsIn our previous example, our explanatory process consisted of mathematical equations derived from Newtonian physics. What if instead of such equations, we had a neural network? An MLP that gave better predictions than our own physics modelling:This neural-network-based explanation correctly accounts for observed outcomes so it has higher explainability than our previous physics model.Compared to the previous one, this explanation is less understandable,[5] even though it is more accurate. The physics equations were more legible and easier to relate to intuition than the opaque, large number of weights and activations in the neural network. All things considered, however, this is still a perfectly valid explanation.In our examples so far, the phenomenon of interest (what we want to explain) is the physical world. For Interpretability in the context of AI Safety, our phenomenon of interest is the neural network itself. Our setting is the input vectors to neural network. Our predictions and our observed outcomes are the same, the output vectors resulting from running the neural network. In this situation, there is no difference between predictions and observed outcomes, since both are canonically the direct output of our network. We can then simplify our abstraction for explanations:We select a set of inputs that map to a set of outputs through our model.This means that to treat a neural network as an explanation, it suffices to specify: An input domain (setting)A computational graph, the neural network model (process)The observed outcomes are exactly the predictions, and they are entirely determined by the setting and process, since inference is deterministic[6].For our purposes, an explanation is a model and a domain specificationOkay, now what?We have an explanation that is not understandable to humans.How do we make sense of this?Well, we interpret it.What is an Interpretation?An interpretation is a map from a less understandable explanation to a more understandable explanation. We call the less understandable explanation our target. For us, this would be the neural network that gives us correct answers, but we do not understand how it reaches those answers.Our more understandable explanation is our proxy. We would like an explanation we can operate on. Right now, our target explanation is essentially a black box.Looking back at our running example, if we want to engineer a way to make the ball even more bouncy, we have no intuition for how to do so. We would like to map our target neural network onto a mental model (a computational abstraction, a set of equations, an algorithm, …) that empowers us to control it.Investigating this mapping will be the bulk of the work in this sequence.Example: Anthropic Circuit TracingLast year, Anthropic’s research team published a methodology on their blog for interpreting their Large Language Models (LLMs). The replacement model is a proxy explanation for the production model. The attribution graph is a proxy explanation for the replacement model. We will not analyze it here (yet). The key point worth noting is how this conception of interpretation as a kind of map aligns with and captures how we do mechanistic interpretability research.Circuits as InterpretationsUsually, when we interpret a circuit in a neural network, we actually perform two interpretations:Circuit Interpretation: We attribute certain behavior of a full model to one of its (sub)circuitsAlgorithmic Interpretation: We reformulate the isolated circuit as a human-understandable algorithm.Recent work[1] has shown that both interpretations are problematic. In this sequence, we will first focus on the first interpretation (Circuit Interpretation).Now, we are finally ready to state the problem:Non-Identifiability ProblemThe non-identifiability problem arises when multiple competing interpretations fit our current criteria equally well, leaving us without a principled basis for selecting among them. Concretely, the problem is that we have multiple circuits, which we could attribute to the full model’s behavior. If we have multiple candidate proxy explanations, like:What can we introduce to help us guide our choice of proxy?How do we relate candidates to each other? That’s the problem!Join us in this sequence as we conduct experiments with toy models to elucidate our way forward. ^Meloux, M., Portet, F., Maniu, S., & Peyrard, M. (2025). Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? Proceedings of the International Conference on Learning Representations (ICLR). arXiv: 2502.20914.^Meloux, M., Dirupo, G., Portet, F., & Peyrard, M. (2025). The Dead Salmons of AI Interpretability. arXiv preprint arXiv:2512.18792.^This post drawns heavily from:Rios-Sialer, I. (2025). Category-Theoretic Wanderings into Interpretability. Presented at EurIPS 2025, Queer in AI. Retrieved from https://www.unrulyabstractions.com/pdfs/wanderings.pdfRefer to paper for complete bibliography.^Erasmus, A., Brunet, T. D. P., & Fisher, E. (2021). What is Interpretability? Philosophy & Technology, 34(4), 833–862. https://doi.org/10.1007/s13347-020-00435-2^Broniatowski, D. A. (2021). Psychological Foundations of Explainability and Interpretability in Artificial Intelligence. NIST Interagency/Internal Report (NISTIR) 8367. National Institute of Standards and Technology. https://doi.org/10.6028/NIST.IR.8367^We are ignoring possible hardware failures and floating point precision issues. We will return to these in a future post when we reason about substrates.Discuss Read More
Intro: Non-Identifiability of Explanations
This is the introduction to the Which Circuit is it? sequence. We will develop some basic concepts and introduce the non-identifiability problem. This project is done in collaboration with Groundless. Recent work[1] has shown that, under commonly used criteria for Mechanistic Interpretability (MI), neural network explanations are not unique, not even in toy models like tiny Multilayer Perceptrons (MLPs). This is referred to as the non-identifiability of explanations. Why is non-identifiability a problem for Mechanistic Interpretability? Literature[2] points out three ways in which non-identifiability makes Mechanistic Interpretability fragile:Poor GeneralizationSensitivity to Design ChoicesIncreased Risk of False Discoveries Throughout this sequence, we will attempt to elucidate what each failure mode entails. In this first post, we take a step back to focus on foundational concepts, setting the stage for the experiments and reflections we will conduct in subsequent posts. Let’s start with some basic questions:What is an Explanation?Roughly[3], an explanation is a process that generates predictions from a setting to account for outcomes. Example: Physics ModelLet’s imagine we drop a ball from the top of a building and see it bounce. We are interested in explaining why it bounced instead of shattering.Setting: This encompasses the initial conditions of our physical system, such as the building’s height, the materials of the ball and the ground, and the initial positions and velocities.In our example, the setting is the input variable sto our physics model, the system state at Outcomes: This is the resulting phenomenon of interest, the end state of our physical system.At , we see the ball bouncing instead of shattering, so our physical model should output a matching prediction. Initially, without having a full explanation, we have a correspondence between a particular setting and an outcome. Our observations occur at two different instances ( and ). Even though many phyisical laws are time-symmetric, our human experience only moves forward in time. In a sense, a pre-requisite for a physics explanation is a selection (and temporal ordering) of events.The locality of an explanation is connected to how we consider these correspondences. For now, we just highlight that our observations are specific. If we had multiple trials, it would be incorrect to blindly associate the setting of one trial with the outcome of another one.Nowadays, we are convinced physical laws are constant. However, that is not an a priori fact. Multiple explanations could account for how the settings lead to the observed outcome. Let’s consider a simple Newtonian physics model that ignores air resistance: Process: We can evaluate our setting by plugging it into our chosen mathematical model. Simple physics model without air drag Prediction: The equations of our model allow us to compute a prediction.Let’s imagine our model predicts the ball would shatter In this example, the explanatory process is akin to running a simulation. Our chosen physical model is deterministic, non-chaotic, analytically tractable (with a closed-form solution), and computationally low-cost. Our chosen physical model does not account for air drag so we (wrongly) predict that the ball will shatter Lastly, our observed outcomes and predictions can differ, like they do in this example. Our chosen physical model has low explainability[4] as it does not properly account for our observed outcome (we ignored air drag and got inaccurate predictions). A better explanation would have predictions that actually match observed outcomes.By comparing the observed outcome with our prediction, we can reason about accuracy, precision, uncertainty and incompleteness of our explanation. To summarize, this is a bird’s-eye view of our explanation:Previous work has developed abstractions to abstract and formalize this notion of explanation:Don’t think too much about this diagram’s category-theoretic origins. Treat this just as doodles and scribbles of what we have already discussed in our example.This is a very general way of framing explanations.How does this relate to neural networks? Neural Networks as ExplanationsIn our previous example, our explanatory process consisted of mathematical equations derived from Newtonian physics. What if instead of such equations, we had a neural network? An MLP that gave better predictions than our own physics modelling:This neural-network-based explanation correctly accounts for observed outcomes so it has higher explainability than our previous physics model.Compared to the previous one, this explanation is less understandable,[5] even though it is more accurate. The physics equations were more legible and easier to relate to intuition than the opaque, large number of weights and activations in the neural network. All things considered, however, this is still a perfectly valid explanation.In our examples so far, the phenomenon of interest (what we want to explain) is the physical world. For Interpretability in the context of AI Safety, our phenomenon of interest is the neural network itself. Our setting is the input vectors to neural network. Our predictions and our observed outcomes are the same, the output vectors resulting from running the neural network. In this situation, there is no difference between predictions and observed outcomes, since both are canonically the direct output of our network. We can then simplify our abstraction for explanations:We select a set of inputs that map to a set of outputs through our model.This means that to treat a neural network as an explanation, it suffices to specify: An input domain (setting)A computational graph, the neural network model (process)The observed outcomes are exactly the predictions, and they are entirely determined by the setting and process, since inference is deterministic[6].For our purposes, an explanation is a model and a domain specificationOkay, now what?We have an explanation that is not understandable to humans.How do we make sense of this?Well, we interpret it.What is an Interpretation?An interpretation is a map from a less understandable explanation to a more understandable explanation. We call the less understandable explanation our target. For us, this would be the neural network that gives us correct answers, but we do not understand how it reaches those answers.Our more understandable explanation is our proxy. We would like an explanation we can operate on. Right now, our target explanation is essentially a black box.Looking back at our running example, if we want to engineer a way to make the ball even more bouncy, we have no intuition for how to do so. We would like to map our target neural network onto a mental model (a computational abstraction, a set of equations, an algorithm, …) that empowers us to control it.Investigating this mapping will be the bulk of the work in this sequence.Example: Anthropic Circuit TracingLast year, Anthropic’s research team published a methodology on their blog for interpreting their Large Language Models (LLMs). The replacement model is a proxy explanation for the production model. The attribution graph is a proxy explanation for the replacement model. We will not analyze it here (yet). The key point worth noting is how this conception of interpretation as a kind of map aligns with and captures how we do mechanistic interpretability research.Circuits as InterpretationsUsually, when we interpret a circuit in a neural network, we actually perform two interpretations:Circuit Interpretation: We attribute certain behavior of a full model to one of its (sub)circuitsAlgorithmic Interpretation: We reformulate the isolated circuit as a human-understandable algorithm.Recent work[1] has shown that both interpretations are problematic. In this sequence, we will first focus on the first interpretation (Circuit Interpretation).Now, we are finally ready to state the problem:Non-Identifiability ProblemThe non-identifiability problem arises when multiple competing interpretations fit our current criteria equally well, leaving us without a principled basis for selecting among them. Concretely, the problem is that we have multiple circuits, which we could attribute to the full model’s behavior. If we have multiple candidate proxy explanations, like:What can we introduce to help us guide our choice of proxy?How do we relate candidates to each other? That’s the problem!Join us in this sequence as we conduct experiments with toy models to elucidate our way forward. ^Meloux, M., Portet, F., Maniu, S., & Peyrard, M. (2025). Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? Proceedings of the International Conference on Learning Representations (ICLR). arXiv: 2502.20914.^Meloux, M., Dirupo, G., Portet, F., & Peyrard, M. (2025). The Dead Salmons of AI Interpretability. arXiv preprint arXiv:2512.18792.^This post drawns heavily from:Rios-Sialer, I. (2025). Category-Theoretic Wanderings into Interpretability. Presented at EurIPS 2025, Queer in AI. Retrieved from https://www.unrulyabstractions.com/pdfs/wanderings.pdfRefer to paper for complete bibliography.^Erasmus, A., Brunet, T. D. P., & Fisher, E. (2021). What is Interpretability? Philosophy & Technology, 34(4), 833–862. https://doi.org/10.1007/s13347-020-00435-2^Broniatowski, D. A. (2021). Psychological Foundations of Explainability and Interpretability in Artificial Intelligence. NIST Interagency/Internal Report (NISTIR) 8367. National Institute of Standards and Technology. https://doi.org/10.6028/NIST.IR.8367^We are ignoring possible hardware failures and floating point precision issues. We will return to these in a future post when we reason about substrates.Discuss Read More
