Opinion

Was Anthropic that strategically incompetent?

​This is pretty vague. It might be vague because Dario doesn’t have a plan there, or might be that the asks are too overton-shattering to say right now and he is waiting till there’s clearer evidence to throw at people.I think it’s plausible Dario’s take is like “look it’s not that useful to have a plan in advance here because it’ll depend a ton on what the gameboard looks like at the time.”Raemon on Anthropic’s plans in March 2025A year ago, @Raemon did a writeup of reasons for Anthropic-like strategies of racing, then slowing down when AGI becomes imminent, not to work. Anthropic’s worldview as described by RaemonSuperalignment is probably not that hard[1] to navigate.Misuse by totalitarian regimes or rogue actors is reasonably likely by default, and very bad.AGI founded in The West would not be bad in the ways that totalitarian regimes would be.Quick empiricism is a way better approach to solving research problems than trying to “think ahead” too cleverly.It’s better to only make strong claims when you have strong evidence to back them up, and only make strong commitments when you’re confident about those commitments being the right ones. (This is partly about “well, there are some political games you just do need to play”, but at least partly motivated by epistemic principles).Anthropic’s strategy as described by RaemonStay on the frontier of AI, so that they reliably have “a seat at the table” during the realpolitik discussions of how AI policy plays out.Race to get ~human-level AGI that helps with scalable superalignment research.Iterate on various empirical frameworks for scalable oversight, e.g. various ways you can use weaker models to check that larger untrusted models are safe. Use a mix of interpretability and control techniques to help.Encourage the West to race, banking on it being easier to get companies / politicians to pivot to safer practices when there’s clearer smoking guns, and clearer asks to make of people.Eventually they do need to spend a bunch of political chips, which they seem to be starting to spend this year. But, those chips are focused on something like: “race to ensure lead over China, then maybe briefly slow down when the lead is more secure and we’re close to the dangerous ASL levels, then invent superintelligence, and use it to solve a bunch of problems” (as opposed to “global moratorium with international buy-in”)????However, I don’t think that Anthropic-like strategies are the best only in worlds where alignment is easy. In the end, mankind will either create the ASI, globally ban it until A Lot Of Alignment Research Is Done In A Way Which Is Robust To Mankind’s Incompetence Requiring Much Evidence, or destroy itself for not-so-related reasons (e.g. in a nuclear war). [2]Additionally, any attempt to enact a global ban would affect Chinese companies by the very definition and requires the CCP to also sign a ban-enforcing treaty. I doubt that getting the CCP to sign the anti-AI treaty is any easier than making the USG destroy xAI for having Grok become MechaHitler. MIRI cluster’s effortsSuppose that alignment is indeed as hard as the MIRI cluster believes. Then mankind’s survival would require a global ban of AI-related research. In turn, a global ban would require those who are convinced in its necessity to convince politicians or to achieve leverage over politicians who can enact this ban (e.g. by having some American politicians become x-risk-pilled after the IABIED march, then having them propose the mutual ban to China, the EU, etc). The most prominent effort of the MIRI cluster to stop the ASI race is the IABIED book and the march. However, the march was either poorly organised (e.g. I suspect that a 10K march in New York could have been arranged by now, but I doubt that it would have the same effect) or is closer to a demonstration of weakness since the amount of people declared necessary for the march is two OOMs more than the amount of people who signed the pledge.Since the march requires every seventh person in Washington DC to join it, it could be also useful to consider a layman’s perspective. The layman’s beliefs are to somehow reach the point where the layman would believe that a global ban is the best action. Evidence capable of changing the layman’s beliefs can be empirical, non-empirical or the ones which simply awaken the layman to AGI-related threats and let one do further research on x-risk. Unfortunately, I don’t see a way to deliver theoretical evidence to the layman, except for what Yudkowsky has already tried[3] to do. Empirical evidence of the AIs being evaluation aware or even doing clearly misaligned things also mostly comes either from Anthropic and OpenAI or from visible failures like GPT-4o driving people into psychosis. Therefore, I don’t believe that there is a better strategy to prevent a misaligned ASI than a hybrid of awakening as many people as possible, providing as much evidence as possible and preparing to enact a ban once it becomes likely to succeed (e.g. sufficiently popular among the Congress or the ban being proposed in the UN by a representative).Anthropic’s role in the AI raceAnthropic’s initial plan was to cause a race to the top of safety which would either succeed in alignment or generate evidence of our inability to verify it. The plan proved itself a failure, which raises the question of what a counterfactual Anthropic should’ve done to convince the other companies to stop the race to the most capable models. When xAI was founded on March 9, 2023, Anthropic didn’t even release the first Claude model. The commitment not to advance the frontier of capabilities also wasn’t ditched until Claude 3.5 Sonnet.As Zvi described it in June 2024, “The better question is to what extent Anthropic’s actions make the race more on than it would have been anyway (italics mine — S.K.), given the need to race Google and company. One Anthropic employee doubts this. Whereas Roon famously said Anthropic is controlled opposition that exists to strike fear in the hearts of members of OpenAI’s technical staff.I do not find the answer of ‘none at all’ plausible. I do find the answer ‘not all that much’ reasonably plausible, and increasingly plausible as there are more players. If OpenAI and company are already going as fast as they can, that’s that.” On July 8, 2025, mankind even saw xAI’s model call itself MechaHitler and xAI itself release the most capable model at the time just a day later, meaning that xAI punts on safety and that doing anything with the race would likely require enough evidence of its dangers to (have politicians) stop such bad actors. Additionally, I find it hard to believe that Anthropic could have influenced the Chinese race towards AGI in any ways except for letting China distill from Anthropic’s models, which Anthopic tried to prevent upon learning.Anthropic’s role in evidence generationThe next topic is the role which Anthropic played in creating evidence for or against the idea that alignment is hard. Anthropic describing interpretability tools, releasing alignment assessment methods and results is probably the only thing that could have contributed to safety washing. Evhub’s case for alignment being hard implies that Anthropic has yet to encounter evidence of hard-to-solve problems like long-horizon RL causing the models to drift into misalignment. Suppose that Anthropic tried as hard as it could to find evidence and didn’t succeed because the right evidence didn’t exist in the first place. Then any empirical demo of hard-to-detect misalignment which could’ve convinced the world to stop is to come from a long-horizon RL experiment. Whenever someone conducts a long-horizon RL experiment, it may end with an unaligned AI taking over, with creating an aligned AI or in generation of evidence which makes a global ban easier or reduces the chance of a future experiment to cause takeover. Therefore, mankind is to squeeze the most out of each such experiment. What strategy could one propose other than the Anthropic-like one of careful monitoring and testing the ideas on CoT-based AIs where misalignment is more legible and praying for empirical evidence to finally emerge? Anthropic applying mechinterp to GPT-N or Gemini-3-sociopath?^Anthropic people self report as thinking “alignment may be hard”, but, Raemon is comparing this to the MIRI cluster who are like “it is so hard your plan is fundamentally bad, please stop.”^Unfortunately, the latter option doesn’t actually imply anything except for a potential need to race towards the ASI in a manner similar to Bostrom’s idea that egoists should advocate for accelerating the ASI.^Yudkowsky’s attempt to bite the bullet also earned a share of negative reviews pointing out that the thesis was argued in an overly weird manner as is typical for Yudkowsky. Discuss ​Read More

​This is pretty vague. It might be vague because Dario doesn’t have a plan there, or might be that the asks are too overton-shattering to say right now and he is waiting till there’s clearer evidence to throw at people.I think it’s plausible Dario’s take is like “look it’s not that useful to have a plan in advance here because it’ll depend a ton on what the gameboard looks like at the time.”Raemon on Anthropic’s plans in March 2025A year ago, @Raemon did a writeup of reasons for Anthropic-like strategies of racing, then slowing down when AGI becomes imminent, not to work. Anthropic’s worldview as described by RaemonSuperalignment is probably not that hard[1] to navigate.Misuse by totalitarian regimes or rogue actors is reasonably likely by default, and very bad.AGI founded in The West would not be bad in the ways that totalitarian regimes would be.Quick empiricism is a way better approach to solving research problems than trying to “think ahead” too cleverly.It’s better to only make strong claims when you have strong evidence to back them up, and only make strong commitments when you’re confident about those commitments being the right ones. (This is partly about “well, there are some political games you just do need to play”, but at least partly motivated by epistemic principles).Anthropic’s strategy as described by RaemonStay on the frontier of AI, so that they reliably have “a seat at the table” during the realpolitik discussions of how AI policy plays out.Race to get ~human-level AGI that helps with scalable superalignment research.Iterate on various empirical frameworks for scalable oversight, e.g. various ways you can use weaker models to check that larger untrusted models are safe. Use a mix of interpretability and control techniques to help.Encourage the West to race, banking on it being easier to get companies / politicians to pivot to safer practices when there’s clearer smoking guns, and clearer asks to make of people.Eventually they do need to spend a bunch of political chips, which they seem to be starting to spend this year. But, those chips are focused on something like: “race to ensure lead over China, then maybe briefly slow down when the lead is more secure and we’re close to the dangerous ASL levels, then invent superintelligence, and use it to solve a bunch of problems” (as opposed to “global moratorium with international buy-in”)????However, I don’t think that Anthropic-like strategies are the best only in worlds where alignment is easy. In the end, mankind will either create the ASI, globally ban it until A Lot Of Alignment Research Is Done In A Way Which Is Robust To Mankind’s Incompetence Requiring Much Evidence, or destroy itself for not-so-related reasons (e.g. in a nuclear war). [2]Additionally, any attempt to enact a global ban would affect Chinese companies by the very definition and requires the CCP to also sign a ban-enforcing treaty. I doubt that getting the CCP to sign the anti-AI treaty is any easier than making the USG destroy xAI for having Grok become MechaHitler. MIRI cluster’s effortsSuppose that alignment is indeed as hard as the MIRI cluster believes. Then mankind’s survival would require a global ban of AI-related research. In turn, a global ban would require those who are convinced in its necessity to convince politicians or to achieve leverage over politicians who can enact this ban (e.g. by having some American politicians become x-risk-pilled after the IABIED march, then having them propose the mutual ban to China, the EU, etc). The most prominent effort of the MIRI cluster to stop the ASI race is the IABIED book and the march. However, the march was either poorly organised (e.g. I suspect that a 10K march in New York could have been arranged by now, but I doubt that it would have the same effect) or is closer to a demonstration of weakness since the amount of people declared necessary for the march is two OOMs more than the amount of people who signed the pledge.Since the march requires every seventh person in Washington DC to join it, it could be also useful to consider a layman’s perspective. The layman’s beliefs are to somehow reach the point where the layman would believe that a global ban is the best action. Evidence capable of changing the layman’s beliefs can be empirical, non-empirical or the ones which simply awaken the layman to AGI-related threats and let one do further research on x-risk. Unfortunately, I don’t see a way to deliver theoretical evidence to the layman, except for what Yudkowsky has already tried[3] to do. Empirical evidence of the AIs being evaluation aware or even doing clearly misaligned things also mostly comes either from Anthropic and OpenAI or from visible failures like GPT-4o driving people into psychosis. Therefore, I don’t believe that there is a better strategy to prevent a misaligned ASI than a hybrid of awakening as many people as possible, providing as much evidence as possible and preparing to enact a ban once it becomes likely to succeed (e.g. sufficiently popular among the Congress or the ban being proposed in the UN by a representative).Anthropic’s role in the AI raceAnthropic’s initial plan was to cause a race to the top of safety which would either succeed in alignment or generate evidence of our inability to verify it. The plan proved itself a failure, which raises the question of what a counterfactual Anthropic should’ve done to convince the other companies to stop the race to the most capable models. When xAI was founded on March 9, 2023, Anthropic didn’t even release the first Claude model. The commitment not to advance the frontier of capabilities also wasn’t ditched until Claude 3.5 Sonnet.As Zvi described it in June 2024, “The better question is to what extent Anthropic’s actions make the race more on than it would have been anyway (italics mine — S.K.), given the need to race Google and company. One Anthropic employee doubts this. Whereas Roon famously said Anthropic is controlled opposition that exists to strike fear in the hearts of members of OpenAI’s technical staff.I do not find the answer of ‘none at all’ plausible. I do find the answer ‘not all that much’ reasonably plausible, and increasingly plausible as there are more players. If OpenAI and company are already going as fast as they can, that’s that.” On July 8, 2025, mankind even saw xAI’s model call itself MechaHitler and xAI itself release the most capable model at the time just a day later, meaning that xAI punts on safety and that doing anything with the race would likely require enough evidence of its dangers to (have politicians) stop such bad actors. Additionally, I find it hard to believe that Anthropic could have influenced the Chinese race towards AGI in any ways except for letting China distill from Anthropic’s models, which Anthopic tried to prevent upon learning.Anthropic’s role in evidence generationThe next topic is the role which Anthropic played in creating evidence for or against the idea that alignment is hard. Anthropic describing interpretability tools, releasing alignment assessment methods and results is probably the only thing that could have contributed to safety washing. Evhub’s case for alignment being hard implies that Anthropic has yet to encounter evidence of hard-to-solve problems like long-horizon RL causing the models to drift into misalignment. Suppose that Anthropic tried as hard as it could to find evidence and didn’t succeed because the right evidence didn’t exist in the first place. Then any empirical demo of hard-to-detect misalignment which could’ve convinced the world to stop is to come from a long-horizon RL experiment. Whenever someone conducts a long-horizon RL experiment, it may end with an unaligned AI taking over, with creating an aligned AI or in generation of evidence which makes a global ban easier or reduces the chance of a future experiment to cause takeover. Therefore, mankind is to squeeze the most out of each such experiment. What strategy could one propose other than the Anthropic-like one of careful monitoring and testing the ideas on CoT-based AIs where misalignment is more legible and praying for empirical evidence to finally emerge? Anthropic applying mechinterp to GPT-N or Gemini-3-sociopath?^Anthropic people self report as thinking “alignment may be hard”, but, Raemon is comparing this to the MIRI cluster who are like “it is so hard your plan is fundamentally bad, please stop.”^Unfortunately, the latter option doesn’t actually imply anything except for a potential need to race towards the ASI in a manner similar to Bostrom’s idea that egoists should advocate for accelerating the ASI.^Yudkowsky’s attempt to bite the bullet also earned a share of negative reviews pointing out that the thesis was argued in an overly weird manner as is typical for Yudkowsky. Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *