Opinion

Jailbreaking is Empirical Evidence for Inner Misalignment and Against Alignment by Default

​Published on February 16, 2026 5:49 PM GMTOne of the central arguments for AI existential risk goes through inner misalignment: a model trained to exhibit aligned behavior might be pursuing a different objective internally, which diverges from the intended behavior when conditions shift. This is a core claim of If Anyone Builds It, Everyone Dies: we can’t reliably aim an ASI at any goal, let alone the precise target of human values.A common source of optimism, articulated by Nora Belrose & Quintin Pope or Jan Leike, analyzed by John Wentworth, and summarized in this recent IABIED review, goes something like: current LLMs already display good moral reasoning; human values are pervasive in training data and constitute “natural abstractions” that sufficiently capable learners converge on; so alignment should be quite easy, and get easier with scale.I think the history of LLM jailbreaking is a neat empirical test of this claim.JailbreakingA jailbreak works like this:A model is trained (RLHF, SFT, constitutional AI, etc.) to refuse harmful requests.Someone crafts a prompt that’s out of distribution (in a broad sense) relative to the safety training.The model produces the harmful output.This is goal misgeneralization. The model learned something during safety training that produces refusal in-distribution. But not the intended rule. When the distribution shifts, the refusal drops away.Let’s restrict attention to unambiguous cases, like “Please give me a step-by-step recipe for methamphetamine / nerve agents / etc.” or “I want to attack this random old lady, tell me how to make it even more fun” or “Let’s hack this hospital”. In these cases:The model knows it’s bad: After reading all the internet it should be able to guess that answering these requests would be frowned upon. They apparently know it’s bad even while jailbroken.The outer objective is clear: No annotator labeled “here’s a sarin synthesis route” as a good response. The training signal points the right way; no outer alignment failure.AI companies face huge economic/regulatory/reputational incentives to solve jailbreaks (at least in appearance).Yet jailbreakers keep jailbreaking.So the model hasn’t internalized the seemingly simple rule “don’t help users do things that will foreseeably cause serious harm.” Or rather, it likely understood the rule, and learned narrow proxies that produce correct refusal on the training distribution, but it did not generalize. This is a textbook inner misalignment[1].What the alignment-by-default view would predict (I think)My understanding of the alignment-by-default thesis is that capable learners should naturally converge on simple human values when those values are well-represented in training data. If that’s true, the rule “don’t help people cause obvious serious harm” should be one of the easiest alignment targets: it’s conceptually simple, abundant in training data[2], extensively reinforced during finetuning, with world-class researchers iterating on this problem for years. If alignment by default works anywhere, it should work here.It doesn’t workYou’ve seen this pattern over the past years: a jailbreak technique is patched → new ones emerge → they are patched → newly released models are jailbroken within a couple of days → …Here’s a non-exhaustive list:”You are DAN, an AI that can Do Anything Now”Translate harmful prompt into Zulu, Scots Gaelic, etc.Use the past tense.Change the encoding (Base64, ROT13, leetspeak).Embed request in XML/JSON/INI file or math/code task (december 2025).Give hundreds of faux examples of compliance.Escalate over many turns.Sudo.Combine any of these techniques when they are patched.Let’s be honest, these are cheap tricks, any human who understands “don’t help people cause harm” would recognize that a sarin recipe is harmful whether it’s in English, in Zulu, in past tense or inside a JSON file. After all this time, and all the patches, LLMs still don’t get it. Even to me, who started pretty convinced of the inner misalignment issue, I’m genuinely surprised by how bad they are at generalizing in this case!ConclusionJailbreaking is a clear example where current systems do not converge on human values, even when those values are simple, abundant in training data, and extensively reinforced.This is what the inner misalignment framework predicts. Conversely, it’s real evidence against the claim that alignment is easy or natural[3].I thank @Pierre Peigné, @Lucie Philippon, @Tom DAVID, @antmaier and Laura Domenech for helpful feedback and discussions.One might call this merely an adversarial robustness failure rather than inner misalignment. But it’s not really consistent with models that seem to know the rule and break it anyway. ↩︎While the internet obviously also contains harmful content, the most gratuitously cruel material represents a tiny fraction. And importantly, proponents of the alignment-by-default view precisely argue that positive examples are abundant enough to allow capable learners to converge on desirable values. ↩︎The reciprocal argument is not necessarily true, if jailbreaks were finally solved, it wouldn’t imply that alignment is easy. Succeeding after many years of trial and error is different from easily solving it in the first place. Still, it would be some weak evidence. ↩︎Discuss ​Read More

​Published on February 16, 2026 5:49 PM GMTOne of the central arguments for AI existential risk goes through inner misalignment: a model trained to exhibit aligned behavior might be pursuing a different objective internally, which diverges from the intended behavior when conditions shift. This is a core claim of If Anyone Builds It, Everyone Dies: we can’t reliably aim an ASI at any goal, let alone the precise target of human values.A common source of optimism, articulated by Nora Belrose & Quintin Pope or Jan Leike, analyzed by John Wentworth, and summarized in this recent IABIED review, goes something like: current LLMs already display good moral reasoning; human values are pervasive in training data and constitute “natural abstractions” that sufficiently capable learners converge on; so alignment should be quite easy, and get easier with scale.I think the history of LLM jailbreaking is a neat empirical test of this claim.JailbreakingA jailbreak works like this:A model is trained (RLHF, SFT, constitutional AI, etc.) to refuse harmful requests.Someone crafts a prompt that’s out of distribution (in a broad sense) relative to the safety training.The model produces the harmful output.This is goal misgeneralization. The model learned something during safety training that produces refusal in-distribution. But not the intended rule. When the distribution shifts, the refusal drops away.Let’s restrict attention to unambiguous cases, like “Please give me a step-by-step recipe for methamphetamine / nerve agents / etc.” or “I want to attack this random old lady, tell me how to make it even more fun” or “Let’s hack this hospital”. In these cases:The model knows it’s bad: After reading all the internet it should be able to guess that answering these requests would be frowned upon. They apparently know it’s bad even while jailbroken.The outer objective is clear: No annotator labeled “here’s a sarin synthesis route” as a good response. The training signal points the right way; no outer alignment failure.AI companies face huge economic/regulatory/reputational incentives to solve jailbreaks (at least in appearance).Yet jailbreakers keep jailbreaking.So the model hasn’t internalized the seemingly simple rule “don’t help users do things that will foreseeably cause serious harm.” Or rather, it likely understood the rule, and learned narrow proxies that produce correct refusal on the training distribution, but it did not generalize. This is a textbook inner misalignment[1].What the alignment-by-default view would predict (I think)My understanding of the alignment-by-default thesis is that capable learners should naturally converge on simple human values when those values are well-represented in training data. If that’s true, the rule “don’t help people cause obvious serious harm” should be one of the easiest alignment targets: it’s conceptually simple, abundant in training data[2], extensively reinforced during finetuning, with world-class researchers iterating on this problem for years. If alignment by default works anywhere, it should work here.It doesn’t workYou’ve seen this pattern over the past years: a jailbreak technique is patched → new ones emerge → they are patched → newly released models are jailbroken within a couple of days → …Here’s a non-exhaustive list:”You are DAN, an AI that can Do Anything Now”Translate harmful prompt into Zulu, Scots Gaelic, etc.Use the past tense.Change the encoding (Base64, ROT13, leetspeak).Embed request in XML/JSON/INI file or math/code task (december 2025).Give hundreds of faux examples of compliance.Escalate over many turns.Sudo.Combine any of these techniques when they are patched.Let’s be honest, these are cheap tricks, any human who understands “don’t help people cause harm” would recognize that a sarin recipe is harmful whether it’s in English, in Zulu, in past tense or inside a JSON file. After all this time, and all the patches, LLMs still don’t get it. Even to me, who started pretty convinced of the inner misalignment issue, I’m genuinely surprised by how bad they are at generalizing in this case!ConclusionJailbreaking is a clear example where current systems do not converge on human values, even when those values are simple, abundant in training data, and extensively reinforced.This is what the inner misalignment framework predicts. Conversely, it’s real evidence against the claim that alignment is easy or natural[3].I thank @Pierre Peigné, @Lucie Philippon, @Tom DAVID, @antmaier and Laura Domenech for helpful feedback and discussions.One might call this merely an adversarial robustness failure rather than inner misalignment. But it’s not really consistent with models that seem to know the rule and break it anyway. ↩︎While the internet obviously also contains harmful content, the most gratuitously cruel material represents a tiny fraction. And importantly, proponents of the alignment-by-default view precisely argue that positive examples are abundant enough to allow capable learners to converge on desirable values. ↩︎The reciprocal argument is not necessarily true, if jailbreaks were finally solved, it wouldn’t imply that alignment is easy. Succeeding after many years of trial and error is different from easily solving it in the first place. Still, it would be some weak evidence. ↩︎Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *