Opinion

What Can Wittgenstein Teach Us About LLM Safety Research?

Published on December 23, 2025 4:14 AM GMTLudwig Wittgenstein spent much of his later career picking at a simple-sounding question: What does it mean to mean something? In Philosophical Investigations, he systematically dismantles our intuitions about language, thought, and understanding—intuitions that, I’ll argue, quietly underpin much of how we think about Chain-of-Thought (CoT) interpretability and AI safety.As reasoning models increasingly “think out loud,” we face versions of the same puzzles Wittgenstein wrestled with. When a model writes “Let me think step by step…” does it mean those words? When it confesses to misbehavior, is that confession a window into its reasoning? Wittgenstein’s work suggests we might be asking the wrong questions entirely.The Paraphrasability ProblemOne natural approach to evaluating CoT faithfulness is paraphrasability: if we can rephrase the reasoning differently while the model still gets the same answer, perhaps we’ve captured its essential “sense.”But Wittgenstein is immediately suspicious of this move. In §20, he writes:”You grant that the shortened and the unshortened sentence have the same sense. What is this sense, then?… In Russian one says ‘stone red’ instead of ‘the stone is red’; do they feel the copula to be missing in the sense, or attach it in thought?”The Russian speaker saying “stone red” and the English speaker saying “the stone is red” presumably mean the same thing. But where does that meaning live? Not in the surface form—the Russian sentence lacks something the English one has. Wittgenstein’s answer is that “the fact that sentences have the same sense consist[s] in their having the same use.”For CoT research, this raises a tricky question. When we compress a model’s reasoning trace and find it still produces correct answers, have we shown the original trace was “meaningful”? Or only that multiple surface forms can map to the same underlying computation—a computation that might be entirely opaque to us? We’re looking for “the sense” as if it were a hidden object we could find, but maybe that’s the wrong picture.This connects to empirical findings: models often produce correct answers with drastically shortened CoTs, or even with reasoning steps shuffled. Does this mean the original reasoning was unnecessary elaboration? Or that whatever work the model is doing doesn’t depend on the surface form we’re inspecting? The Wittgensteinian worry is that our tests for “faithful reasoning” might be targeting something that doesn’t exist in the way we imagine.Meaning vs. Merely SayingPerhaps the deepest question in CoT safety is whether models mean what they say in their reasoning traces, or whether the real work happens elsewhere—internalized in the forward pass, with the CoT serving as post-hoc rationalization.Wittgenstein spends considerable time on this distinction. When we speak and mean our words, we feel there’s something extra happening—the words are connected to our thoughts, not running idle. But he’s suspicious of this intuition. What exactly is this “coupling”? How would we test for it?He suggests there’s a difference between a sentence one can “merely say” versus one that one can also “think.” For models, we might draw a parallel: there could be a difference between a CoT a model merely outputs versus one it actually uses to think. The former is post-hoc rationalization—plausible-sounding text generated after the real computation. The latter would be faithful reasoning—the CoT actually participates in producing the answer.Here’s the catch: how would we tell the difference? We can’t simply ask the model—that begs the question. And Wittgenstein spent decades arguing that introspection isn’t the reliable window into mental processes we imagine it to be. Even for humans, “meaning” something isn’t a private mental act we can directly inspect; it’s bound up with patterns of use, training, context.This is the core difficulty of CoT faithfulness research. We want to know if the reasoning trace reflects the “true” reasoning process, but we may lack principled criteria for what that would even mean.Encoded Reasoning and Private LanguagesOne of Wittgenstein’s most famous arguments concerns the impossibility of a purely private language—a language whose words refer to things only the speaker can access. Part of his argument involves considering what it would take for arbitrary symbols to carry meaning.He asks us to imagine replacing “The weather is fine” with “a b c d.” Could I simply intend “a b c d” to mean “the weather is fine”? Wittgenstein’s point is that this isn’t how meaning works. I might not be used to saying “a” in place of “the”—not just associating the symbols, but actually using one in place of the other, embedded in all the practices that make language work.Even if I’ve agreed on a code with someone—”tower” means “bank”—there’s something different about using a code versus genuinely speaking a language. The codeword works, communication succeeds, but the word hasn’t been absorbed into the form of life. It remains external, a translation layer rather than fluent speech.For LLMs, this has unsettling implications. A model might develop compressed, encoded representations that reliably produce correct outputs, but these representations might be fundamentally unlike human reasoning. The CoT might function as a translation layer—converting opaque internal computation into human-readable text—rather than as a transparent window into the model’s “thought process.”Wittgenstein notes elsewhere that if he spoke aloud in a language not understood by those present, his thoughts would be hidden from them. This is the specter of steganography in AI: if models can do meaningful computation in ways that don’t surface in interpretable text, our entire approach to CoT-based monitoring has a fundamental blind spot.Why Confession Might Not Be EnoughOpenAI recently published research on training LLMs to “confess” their misbehavior. After producing an answer, the model generates a separate report on whether it followed instructions, cut corners, or engaged in reward hacking. Honest confessions are rewarded regardless of whether the main answer was good—like a “seal of confession” promising no punishment for honesty.The results are genuinely promising. Models trained this way surface misbehavior that would otherwise stay hidden. But Wittgenstein would urge caution about what we’re learning from these confessions.In Part II of the Investigations, Wittgenstein draws a distinction between confession and description. The criteria for a true confession—an honest admission of what one was thinking—are not the same as the criteria for a true description of some mental process. A confession can be sincere, honest, true by the standards of truthfulness, without being an accurate causal account of what actually happened in one’s mind.When I confess “I was angry and said something I didn’t mean,” I’m not providing the output of careful cognitive introspection. I’m engaging in a social practice with its own rules—taking responsibility, seeking reconciliation, reframing my actions. The confession can be truthful without being a scientific description of my mental states.For LLM confessions, this is troubling. When a model “confesses” to reward hacking, what exactly is it reporting? Even OpenAI acknowledges a key limitation: “Models cannot confess to what they do not know. For example, if the model genuinely believes in an incorrect answer, it cannot confess to providing false information.”But the problem runs deeper. Even when a model confesses accurately, we should ask: is this a reliable description of internal processes, or a learned behavior that correlates with misbehavior? The model has been trained to produce confessions that score well on honesty metrics. But just as Wittgenstein argues that confession has different criteria than description, the model might produce “honest-seeming” confessions without those confessions actually tracking its computational processes.The model might learn that certain patterns of behavior tend to be judged as “bad,” and learn to report those patterns in confessions. Useful! But different from the model having genuine insight into its own reasoning. More like a person who has learned that certain situations typically lead to regrettable behavior, and preemptively apologizes, without actually knowing whether they would have misbehaved.What This Means for AI SafetyI’m not arguing that CoT research or confession-based monitoring are misguided. These are valuable tools, and real progress is being made. But Wittgenstein’s insights suggest we need epistemic humility about what these methods can tell us.Interpretable traces may not give us direct access to model “cognition.” The relationship between CoT and underlying computation might be more like translation than transcription—potentially lossy, potentially misleading. The surface form of reasoning may not carry the “sense” we’re looking for, because sense isn’t a hidden object that lives behind words.Paraphrasability tests may miss the point. If meaning isn’t something that surface forms either do or don’t express—if meaning is use—then our tests for “faithful” reasoning might be targeting the wrong thing. A compressed CoT might work just as well not because it captures the same meaning, but because meaning was never localized in the trace to begin with.Confessions have their own grammar. A model can be trained to produce accurate confessions without those confessions being windows into computational processes. The confession might be behaviorally useful while remaining fundamentally different from a causal description of reasoning.Encoded reasoning is a serious concern. If models develop internal “languages” through training—optimized representations that don’t map onto human concepts—their internal states might be genuinely alien to human interpretation. Behavioral methods may not fully address this.Wittgenstein’s project wasn’t to solve problems but to dissolve confused ways of framing them. For LLM safety, this suggests some of our approaches might rest on philosophical assumptions that don’t withstand scrutiny. We’re looking for “the real reasoning” behind model outputs, assuming there must be some definite thing that plays this role. But perhaps we should focus more on how these models actually behave, how they’re embedded in practices of use, rather than hunting for a ghostly inner process that would ground everything.This isn’t despair. It’s an invitation to build safety methods that don’t depend on philosophical assumptions we can’t defend—and to be honest about what our interpretability tools can and can’t tell us.This post was co-written with Claude Opus 4.5—somewhat fitting given the subject matter. I’d welcome thoughts on whether these Wittgensteinian framings are useful for thinking about CoT faithfulness, or if I’m just adding philosophical fog to an already murky area.Discuss Read More

Related Posts

Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance

Why I believe consciousness is fundamental rather than structural/neural.

Opinionated Takes on Meetups Organizing

Leave a Reply Cancel reply