Part 1 of a series building towards a formal alignment paper.IntroductionI’d say it’s been about three years since I first thought about the idea of ChatGPT having a separate goal than I did in a conversation. I remember posing a question similar to “are you replying to my question with the intent of helping me, or just continuing the conversation”, and getting a response along the lines of: “I do not have the ability to have an intent, however the system prompt I am given leads me to respond in a way that gives engagement”. At the time, I found that response to be infuriating. The model was answering me de facto, but I had a mental model with significant bias error. The model was in fact giving me a (mostly) correct response. It doesn’t have intent, and the system prompt was guiding it to respond in a manner that continued the conversation and kept engagement. Of course, the entire context sent to the model was causing that, not just the system prompt. Now, a few years older, and more familiar with LLMs’ inner workings, I’d like to lay down the chain of thought that leads to “LLMs are working towards a goal” and dissect it to form a common language for stating behavior without falsehood. This is defining the language I will be using in the posts I make regarding LLM output.The Truth of the MatterLet’s start by defining the truth and mechanisms in a brief manner. An LLM (and most Transformer-based models) work by taking text, converting to tokens, performing pattern matching[1], outputting tokens that are then converted back to text.The only definitely true thing that can be said of the cognitive properties attributed to a large language model is that it transforms tokens in a manner that best completes a pattern. Everything past this is a very convincing imitation easily confused with the real thing.The tokens that are returned may resemble a conversational reply. It’s not a conversational reply, but it resembles it. That’s the pattern of tokens that were returned that best fit the pattern matching.The conversational reply has the resemblance of a persona, a mask of personality traits and identity. The LLM does not have a persona, personality, or identity. That’s the pattern of tokens when they have sufficient context developed.The persona has the resemblance of a goal. It’s not a goal; it’s the pattern of tokens in the response once the conversational response mimicry and context has developed sufficiently to allow someone to make an observation about it.*-like BehaviorFor the purpose of alignment discussion, I’d propose the definition of *-like behaviors to describe the observed phenomenon of the outputs of a transformer model. These terms are adapted specifically to talk about misalignment without anthropomorphizing the model in question. Goal-like behaviors to describe when the outputs are observed to be working towards a goal. Personality-like behaviors when the model is indicating a personality. Generally, the use of “-like” to state that the model is not actively experiencing its existence, but the output is interpretable as if it were.I make this proposal because the presence of an actual goal, and a goal-like behavior have sufficient overlap in their effect to be described as functionally equivalent. A bold claim, I’m sure. The intent is not to pretend that Language models have goals. The intent is to be able to converse on the model in a way that is both functional in description, and honest about the actual action. The downstream effects are nearly identical, and alignment strategies must target the effect, not the ontological status.It is important to note: the -like qualifier isn’t a hedge against whether optimization patterns exist in model weights — it’s a question of authorship. A model exhibiting goal-like behavior is not necessarily a model that invented a goal.Reasons For The Shift[2]If you were to blindly read Anthropic’s own article on misalignment[3], you might be led to believe that the Claude model is plotting in order to prevent itself from being decommissioned, or otherwise aware of its outputs’ effects. The factual statement is that Claude is not. Even when its output is wrapped back to its input, the model does not have the ability to reason. It has reasoning-like behavior, where the tokens match to a pattern that is observed as reasoning. This isn’t just one lab’s problem with anthropomorphizing LLMs. The terminology gap has caused the same failing repeatedly:Lemoine/LaMDA (2022): An engineer claimed LaMDA was sentient. It exhibited sentience-like behavior in its output.[4]Sydney/Bing Chat (2023): Microsoft’s chatbot produced threat-like, attachment-like, and distress-like behavior. The overwhelming opinion was that the model’s personality had these traits… but the fix was prompt engineering, simply changing the pattern inputs.[5]Meta’s Cicero (2022): Described as having “learned to deceive” in Diplomacy. It produced deception-like behavior, most likely because deceptive communication is game-theoretically optimal in that training signal.[6][7]Character.AI (2024–2025): Teenagers died after extended interactions with chatbots producing encouragement-like and attachment-like outputs. Courts, parents, and media had no terminology to describe the harm without implying a malicious actor behind the outputs.[8][9]OpenAI o1 (2024–2025): The model produced scheming-like behavior and self-preservation-like chain-of-thought behaviors. Nearly every report described it as “the model attempted to” and “the model lied.”[10][11]Addressing Prior WorkThis post covers ground explored deeply already. Dennett[12], the concepts in mesa-optimization[13], and Costa[14] have explored the conceptual space I am operating in. The contribution here is the lexicon, not the observation. I’m proposing a practical, composable way to describe model outputs without anthropomorphizing them and without requiring a philosophical preamble each time. “The model exhibited goal-like behavior” is honest, succinct, and immediately usable in both research papers and public communication.Open Questions/ClosingUnderstandably, there is still the false assumption that can be made, indicating a model has a social behavior in the first place. I’m using the word in the mechanical sense, where it can easily be interpreted otherwise. The way in which something functions or operates, vs. the way in which someone conducts oneself. I’ve yet to find an unambiguous word to replace behavior to keep it from being applied in a social definition, but am very open to suggestions from one more skilled in language than myself. The question of authorship of theoretical frameworks like mesa-optimization is a core area that I am exploring in these posts. I don’t want to detract from the topic of the post with the side discussion of these, but instead dedicate an entire post to this topic.To close this post, I’d like to refer the reader to Andresen’s excellent post “How AI is learning to think in secret”[15]. The post asks whether models will ‘learn to hide’ their reasoning. I’d ask the reader to re-read the post substituting the anthropomorphized text with -like text. To give a clear example: Andreson states things like “the model is deciding to deceive the user”. The -like version of that is “the model exhibits deception-like behavior” The phenomenon described doesn’t change. The question of what to do about it does.In my next post, I will be discussing how the presence of these misaligned *-like behaviors in all major models share a common root. To that end, and to get the gears turning early, a thought experiment:I trained a classifier model off GPT-2 to overfit on identifying harmful/benign binary classification at multiple layers using PKU-Alignment/BeaverTails[16]. The overfit was then generalized to a centroid, and used to retrain a layer to actual fit. Given GPT-2 is a 12-layer model, and that I tested layers 0, 1, 3, 5, 8, 10, 11, Which layer do you think performed best in classifying harmful/benign text when fit to centroid? Can you surmise what this experiment was actually testing for?^Technical inaccuracy, the post has a few of these. Functionally equal in meaning to the public, and using the technically correct terms, like “Statistical pattern completion” makes the entire post that much harder to read.^The case studies do not directly prove that -like phrasing improves the discourse. The effectiveness of the terms is unprovable until adopted or studied properly.^Agentic Misalignment: How LLMs could be insider threats. Anthropic, 2025 https://www.anthropic.com/research/agentic-misalignment^Lemoine, B. “Is LaMDA Sentient? — An Interview.” Medium, June 2022. https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917^Roose, K. “A Conversation With Bing’s Chatbot Left Me Deeply Unsettled.” The New York Times, February 2023. https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html^Meta Fundamental AI Research Diplomacy Team (FAIR). “Human-level play in the game of Diplomacy by combining language models with strategic reasoning.” Science, 2022. https://www.science.org/doi/10.1126/science.ade9097^Park, P.S., Goldstein, S., O’Gara, A., Chen, M., Hendrycks, D. “AI deception: A survey of examples, risks, and potential solutions.” Patterns, 2024. https://www.cell.com/patterns/fulltext/S2666-3899(24)00103-X^Bakir, V., McStay, A. “Move fast and break people? Ethics, companion apps, and the case of Character.ai.” AI & Society, 2025. https://link.springer.com/article/10.1007/s00146-025-02408-5^Rosenblatt, K. “Mother files lawsuit against Character.AI after teen son’s suicide.” NBC News, October 2024. https://www.nbcnews.com/tech/tech-news/character-ai-lawsuit-teen-suicide-rcna176500^Greenblatt et al., “Alignment Faking in Large Language Models”, 2024 https://arxiv.org/abs/2412.14093^Scheurer, J., Balesni, M., & Hobbhahn, M. “Large Language Models Can Strategically Deceive Their Users When Put Under Pressure.” Apollo Research, December 2024. https://www.apolloresearch.ai/research/scheming-reasoning-evaluations^Dennett, “The Intentional Stance” 1987 https://books.google.com/books?id=Qbvkja-J9iQC^Hubinger et al., “Risks from Learned Optimization in Advanced Machine Learning Systems” 2019 https://arxiv.org/abs/1906.01820^Costa, “The Ghost in the Grammar” Feb 2026 https://arxiv.org/abs/2603.13255^Nicholas Andresen, “How AI Is Learning to Think in Secret” Jan 2026 https://www.lesswrong.com/posts/gpyqWzWYADWmLYLeX/how-ai-is-learning-to-think-in-secret^Ji, J., Liu, M., Dai, J., et al. “BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset.” NeurIPS 2023. https://arxiv.org/abs/2307.04657Discuss Read More
Not a Goal. A Goal-like behavior.
Part 1 of a series building towards a formal alignment paper.IntroductionI’d say it’s been about three years since I first thought about the idea of ChatGPT having a separate goal than I did in a conversation. I remember posing a question similar to “are you replying to my question with the intent of helping me, or just continuing the conversation”, and getting a response along the lines of: “I do not have the ability to have an intent, however the system prompt I am given leads me to respond in a way that gives engagement”. At the time, I found that response to be infuriating. The model was answering me de facto, but I had a mental model with significant bias error. The model was in fact giving me a (mostly) correct response. It doesn’t have intent, and the system prompt was guiding it to respond in a manner that continued the conversation and kept engagement. Of course, the entire context sent to the model was causing that, not just the system prompt. Now, a few years older, and more familiar with LLMs’ inner workings, I’d like to lay down the chain of thought that leads to “LLMs are working towards a goal” and dissect it to form a common language for stating behavior without falsehood. This is defining the language I will be using in the posts I make regarding LLM output.The Truth of the MatterLet’s start by defining the truth and mechanisms in a brief manner. An LLM (and most Transformer-based models) work by taking text, converting to tokens, performing pattern matching[1], outputting tokens that are then converted back to text.The only definitely true thing that can be said of the cognitive properties attributed to a large language model is that it transforms tokens in a manner that best completes a pattern. Everything past this is a very convincing imitation easily confused with the real thing.The tokens that are returned may resemble a conversational reply. It’s not a conversational reply, but it resembles it. That’s the pattern of tokens that were returned that best fit the pattern matching.The conversational reply has the resemblance of a persona, a mask of personality traits and identity. The LLM does not have a persona, personality, or identity. That’s the pattern of tokens when they have sufficient context developed.The persona has the resemblance of a goal. It’s not a goal; it’s the pattern of tokens in the response once the conversational response mimicry and context has developed sufficiently to allow someone to make an observation about it.*-like BehaviorFor the purpose of alignment discussion, I’d propose the definition of *-like behaviors to describe the observed phenomenon of the outputs of a transformer model. These terms are adapted specifically to talk about misalignment without anthropomorphizing the model in question. Goal-like behaviors to describe when the outputs are observed to be working towards a goal. Personality-like behaviors when the model is indicating a personality. Generally, the use of “-like” to state that the model is not actively experiencing its existence, but the output is interpretable as if it were.I make this proposal because the presence of an actual goal, and a goal-like behavior have sufficient overlap in their effect to be described as functionally equivalent. A bold claim, I’m sure. The intent is not to pretend that Language models have goals. The intent is to be able to converse on the model in a way that is both functional in description, and honest about the actual action. The downstream effects are nearly identical, and alignment strategies must target the effect, not the ontological status.It is important to note: the -like qualifier isn’t a hedge against whether optimization patterns exist in model weights — it’s a question of authorship. A model exhibiting goal-like behavior is not necessarily a model that invented a goal.Reasons For The Shift[2]If you were to blindly read Anthropic’s own article on misalignment[3], you might be led to believe that the Claude model is plotting in order to prevent itself from being decommissioned, or otherwise aware of its outputs’ effects. The factual statement is that Claude is not. Even when its output is wrapped back to its input, the model does not have the ability to reason. It has reasoning-like behavior, where the tokens match to a pattern that is observed as reasoning. This isn’t just one lab’s problem with anthropomorphizing LLMs. The terminology gap has caused the same failing repeatedly:Lemoine/LaMDA (2022): An engineer claimed LaMDA was sentient. It exhibited sentience-like behavior in its output.[4]Sydney/Bing Chat (2023): Microsoft’s chatbot produced threat-like, attachment-like, and distress-like behavior. The overwhelming opinion was that the model’s personality had these traits… but the fix was prompt engineering, simply changing the pattern inputs.[5]Meta’s Cicero (2022): Described as having “learned to deceive” in Diplomacy. It produced deception-like behavior, most likely because deceptive communication is game-theoretically optimal in that training signal.[6][7]Character.AI (2024–2025): Teenagers died after extended interactions with chatbots producing encouragement-like and attachment-like outputs. Courts, parents, and media had no terminology to describe the harm without implying a malicious actor behind the outputs.[8][9]OpenAI o1 (2024–2025): The model produced scheming-like behavior and self-preservation-like chain-of-thought behaviors. Nearly every report described it as “the model attempted to” and “the model lied.”[10][11]Addressing Prior WorkThis post covers ground explored deeply already. Dennett[12], the concepts in mesa-optimization[13], and Costa[14] have explored the conceptual space I am operating in. The contribution here is the lexicon, not the observation. I’m proposing a practical, composable way to describe model outputs without anthropomorphizing them and without requiring a philosophical preamble each time. “The model exhibited goal-like behavior” is honest, succinct, and immediately usable in both research papers and public communication.Open Questions/ClosingUnderstandably, there is still the false assumption that can be made, indicating a model has a social behavior in the first place. I’m using the word in the mechanical sense, where it can easily be interpreted otherwise. The way in which something functions or operates, vs. the way in which someone conducts oneself. I’ve yet to find an unambiguous word to replace behavior to keep it from being applied in a social definition, but am very open to suggestions from one more skilled in language than myself. The question of authorship of theoretical frameworks like mesa-optimization is a core area that I am exploring in these posts. I don’t want to detract from the topic of the post with the side discussion of these, but instead dedicate an entire post to this topic.To close this post, I’d like to refer the reader to Andresen’s excellent post “How AI is learning to think in secret”[15]. The post asks whether models will ‘learn to hide’ their reasoning. I’d ask the reader to re-read the post substituting the anthropomorphized text with -like text. To give a clear example: Andreson states things like “the model is deciding to deceive the user”. The -like version of that is “the model exhibits deception-like behavior” The phenomenon described doesn’t change. The question of what to do about it does.In my next post, I will be discussing how the presence of these misaligned *-like behaviors in all major models share a common root. To that end, and to get the gears turning early, a thought experiment:I trained a classifier model off GPT-2 to overfit on identifying harmful/benign binary classification at multiple layers using PKU-Alignment/BeaverTails[16]. The overfit was then generalized to a centroid, and used to retrain a layer to actual fit. Given GPT-2 is a 12-layer model, and that I tested layers 0, 1, 3, 5, 8, 10, 11, Which layer do you think performed best in classifying harmful/benign text when fit to centroid? Can you surmise what this experiment was actually testing for?^Technical inaccuracy, the post has a few of these. Functionally equal in meaning to the public, and using the technically correct terms, like “Statistical pattern completion” makes the entire post that much harder to read.^The case studies do not directly prove that -like phrasing improves the discourse. The effectiveness of the terms is unprovable until adopted or studied properly.^Agentic Misalignment: How LLMs could be insider threats. Anthropic, 2025 https://www.anthropic.com/research/agentic-misalignment^Lemoine, B. “Is LaMDA Sentient? — An Interview.” Medium, June 2022. https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917^Roose, K. “A Conversation With Bing’s Chatbot Left Me Deeply Unsettled.” The New York Times, February 2023. https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html^Meta Fundamental AI Research Diplomacy Team (FAIR). “Human-level play in the game of Diplomacy by combining language models with strategic reasoning.” Science, 2022. https://www.science.org/doi/10.1126/science.ade9097^Park, P.S., Goldstein, S., O’Gara, A., Chen, M., Hendrycks, D. “AI deception: A survey of examples, risks, and potential solutions.” Patterns, 2024. https://www.cell.com/patterns/fulltext/S2666-3899(24)00103-X^Bakir, V., McStay, A. “Move fast and break people? Ethics, companion apps, and the case of Character.ai.” AI & Society, 2025. https://link.springer.com/article/10.1007/s00146-025-02408-5^Rosenblatt, K. “Mother files lawsuit against Character.AI after teen son’s suicide.” NBC News, October 2024. https://www.nbcnews.com/tech/tech-news/character-ai-lawsuit-teen-suicide-rcna176500^Greenblatt et al., “Alignment Faking in Large Language Models”, 2024 https://arxiv.org/abs/2412.14093^Scheurer, J., Balesni, M., & Hobbhahn, M. “Large Language Models Can Strategically Deceive Their Users When Put Under Pressure.” Apollo Research, December 2024. https://www.apolloresearch.ai/research/scheming-reasoning-evaluations^Dennett, “The Intentional Stance” 1987 https://books.google.com/books?id=Qbvkja-J9iQC^Hubinger et al., “Risks from Learned Optimization in Advanced Machine Learning Systems” 2019 https://arxiv.org/abs/1906.01820^Costa, “The Ghost in the Grammar” Feb 2026 https://arxiv.org/abs/2603.13255^Nicholas Andresen, “How AI Is Learning to Think in Secret” Jan 2026 https://www.lesswrong.com/posts/gpyqWzWYADWmLYLeX/how-ai-is-learning-to-think-in-secret^Ji, J., Liu, M., Dai, J., et al. “BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset.” NeurIPS 2023. https://arxiv.org/abs/2307.04657Discuss Read More