Opinion

Claude is Now Alignment-Pretrained

​Anthropic are now actively using the approach to alignment often called “Alignment Pretraining” or “Safety Pretraining” — using Stochastic Gradient Descent on a large body of natural or synthetic documents showing the AI assistant doing the right thing. They tried this out, found it works well and generalizes well, and are now using it.I’m absolutely delighted. I’ve been repeatedly advocating this approach on LessWrong and the Alignment Forum for a couple of years now:How to Control an LLM’s Behavior (why my P(DOOM) went down)Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?A “Bitter Lesson” Approach to Aligning AGI and ASIWhy Aligning an LLM is Hard, and How to Make it EasierThe Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-TrainingI’ve been very excited about this alignment technique ever since I read the seminal paper demonstrating that it was extremely effective: Pretraining Language Models with Human Preferences (Korbak et al., ’23). This was later followed up by Safety Pretraining: Toward the Next Generation of Safe AI (Maini, Goyal, Sam et al., ’25), You Are What You Eat – AI Alignment Requires Understanding How Data Shapes Structure and Generalisation (Lehalleur, Hoogland, Farrugia-Roberts et al., ’25), and most recently Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment (Tice, Radmard, et al., ’26).Others have also posted on LessWrong on this subject, such as TurnTrout’s Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models,[1] and Beren Millidge’s Alignment In The Age Of Synthetic Data, The case for removing alignment and ML research from the training data and My path to prosaic alignment and open questions. Nostalgebraist discussed something closely related in the void, as did Seth Herd in Broadening the training set for alignment.Anthropic are also specifically using fiction in which Claude does the right thing in difficult moral situations as training material, so they have even implemented my suggestion of Aligned AI Role-Model Fiction, which was also taken up by Aaron Silverbook/Hyperstition AI: see their posts Silicon Morality Plays: The Hyperstition Progress Report and Special Persona Training: Hyperstition Progress Report 2.So I’m happy this idea’s time has finally come.^This post, like some of the other early discussion on this concept on LessWrong, focused primarily on suggesting we should keep examples of AIs behaving badly out of the training set. Anthropic have instead concentrated on the opposite: putting examples of LLMs behaving well into the training set. This is fully consistent with all the academic papers on the subject, which uniformly try both of these approaches and find that increasing good examples is much more important than reducing bad examples.Discuss ​Read More

​Anthropic are now actively using the approach to alignment often called “Alignment Pretraining” or “Safety Pretraining” — using Stochastic Gradient Descent on a large body of natural or synthetic documents showing the AI assistant doing the right thing. They tried this out, found it works well and generalizes well, and are now using it.I’m absolutely delighted. I’ve been repeatedly advocating this approach on LessWrong and the Alignment Forum for a couple of years now:How to Control an LLM’s Behavior (why my P(DOOM) went down)Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?A “Bitter Lesson” Approach to Aligning AGI and ASIWhy Aligning an LLM is Hard, and How to Make it EasierThe Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-TrainingI’ve been very excited about this alignment technique ever since I read the seminal paper demonstrating that it was extremely effective: Pretraining Language Models with Human Preferences (Korbak et al., ’23). This was later followed up by Safety Pretraining: Toward the Next Generation of Safe AI (Maini, Goyal, Sam et al., ’25), You Are What You Eat – AI Alignment Requires Understanding How Data Shapes Structure and Generalisation (Lehalleur, Hoogland, Farrugia-Roberts et al., ’25), and most recently Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment (Tice, Radmard, et al., ’26).Others have also posted on LessWrong on this subject, such as TurnTrout’s Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models,[1] and Beren Millidge’s Alignment In The Age Of Synthetic Data, The case for removing alignment and ML research from the training data and My path to prosaic alignment and open questions. Nostalgebraist discussed something closely related in the void, as did Seth Herd in Broadening the training set for alignment.Anthropic are also specifically using fiction in which Claude does the right thing in difficult moral situations as training material, so they have even implemented my suggestion of Aligned AI Role-Model Fiction, which was also taken up by Aaron Silverbook/Hyperstition AI: see their posts Silicon Morality Plays: The Hyperstition Progress Report and Special Persona Training: Hyperstition Progress Report 2.So I’m happy this idea’s time has finally come.^This post, like some of the other early discussion on this concept on LessWrong, focused primarily on suggesting we should keep examples of AIs behaving badly out of the training set. Anthropic have instead concentrated on the opposite: putting examples of LLMs behaving well into the training set. This is fully consistent with all the academic papers on the subject, which uniformly try both of these approaches and find that increasing good examples is much more important than reducing bad examples.Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *