Opinion

Contra Alexander’s Half-Defence of Bio Anchors

​Published on February 15, 2026 1:08 PM GMT Scott Alexander has a new piece out on Bio Anchors. For those who weren’t keeping up at the time, it went like this:Ajeya Cotra produced a long, deeply sourced report in 2020, which estimated AGI in ~2045, based on anchoring the development of AI to five different biological processes. Each of these anchors gave an estimation for the amount of compute required to produce a human-level intelligence. She then estimated how fast compute would grow.Eliezer Yudkowsky responded with an enormous fictional dialogue where he debates various characters on AI timelines before finally getting to Ajeya Cotra’s report, and estimates that AGI is will arrive much sooner than 2045.[1]Scott Alexander reviewed this debate.Eliezer turned out to be basically right on the timelines thing.[2] Scott Alexander did a retrospective.EDIT: Addditional DiscussionThere is also some discussion deep into a comment thread where Eliezer was discussing bio anchors, but the overall comment threads on that post are pretty messy. Daniel Kokotajlo has praised the model, and (allegedly) claimed that AI 2027 was based on it. His primary improvement to this was to throw out the Bio Anchors part, and used his own estimates of compute required for AGI. I’m not sure how to think of Ajeya’s contribution overall, since estimating the compute required for AGI is the hard part of this, which Ajeya got seriously wrong.Yudkowsky’s Point On Methodology StandsLet me first try to put words in both sides’ mouths.Cotra’s point seems to be: We don’t know what’s going on, let’s try and put together some prima facie good estimates of what might be going on, and then average them, like you would when doing a Fermi estimate.To which Yudkowsky responded: No, you don’t know what’s going on, so your estimates of what might be going on will be totally unconnected to reality, and you can’t get good numbers from things you don’t understand.I’d argue that the crux here is a question of which methods for generating arguments will produce arguments which bind to reality. Cotra believes that her methods will work even when she’s uncertain about the world, Yudkowsky believes they won’t. Fermi EstimationThe Fermi estimate thing makes sense to look at in more detail. Fermi estimation is the process by which you can usually get a decent (i.e. within 1 OOM) estimate of some quantity by just eyeballing/vibing out the intermediate variables. You can watch me fail to do one here:Me Failing To Fermi Estimate Car Dealerships In The USA classic example is estimating the number of car dealerships in the US. You guess how often you buy a car (every five years or so) how long that process takes you (two hours) how many salesmen a dealership employs (ten) how many hours a year they spend on sales (two thousand) and how many people are in the US (three hundred million).Then you multiply and divide those numbers:  car per year per person   hours per car   hours per employee per year   employees per dealership   people =  car dealerships.Real answer: 60,000.Claude guesses that car dealerships only have 3-5 salespersons, people spend more time with them than I thought (4-6 hours total, but the salesperson may only be present for part of that) and <100% utilization of salespersons.I was an OOM off. I may have had a better guess if I had ever bought a car or lived in the US!We can think of Fermi estimates like this as a random walk in log-space. My equation had five elements in it (though one, the US population, could be easily looked up). If each one is off by around a factor of 2, then that’s a step of  OOMs. With five times, we would expect the overall error to be  OOMs, or a factor of 4.6. This is pretty impressive for a guess which doesn’t require you to leave the house!Report-Based EpistemologyThe Bio-Anchors timeline report reminds me of another personal bugbear, the Rethink Priorities Welfare Range report. Both of them have a vast amount of effort put into them, contain a lot of sources, and a huge number of arguments. Both of them are clearly written by thoughtful people working in good faith. I also think that both of their conclusions are basically nonsense.Am I arguing against analysis in all cases? No, not at all! I think this kind of analysis is great for analysing e.g. which charity is better. You can solidly estimate the effectiveness of malaria nets, deworming, lead reduction, and direct giving using standard estimation techniques because none of the sub-questions involved require difficult out-of-distribution thinking.If we look back at the analysis I did earlier, estimating a  OOM error for a five-step Fermie estimate, we can see that it only works when each of your intermediate values is off by a factor of two! If one of your intermediates is off by a factor of infinity, then your final estimate is also off by a factor of infinity.Now, to be fair, Ajeya’s report—like most of EA-style reports—uses a two-fold estimation strategy: it comes up with five disjunct anchors, estimates each separately, and then hedges uncertainty over the anchors. This strategy seems good at first glance, but I think it actually hides problems rather than solves them.Averaging over different estimates can mean multiple things. It could be that you expect your different methods to give the same result, so by averaging, you’re compensating for noise in estimates (like, estimating malaria net effectiveness based on population-level studies vs multiplying net usage by cases prevented, or something). In this case, it’s great.This isn’t what Bio Anchors is doing, which is fine, because there are other good reasons to perform a superficially-similar estimate. If you’re uncertain of a higher-up variable in your causal graph, you might want to consider different options and average them.Unfortunately, this only works if the generator which produces these estimates is functioning correctly! If you’re estimating e.g. which malaria drug the UN will roll out, as a higher-up node in estimating Malaria funding needs, then this is doable and you can expect your estimates here to be sane. If not, then, well…The Nematode ThingOne of Ajeya Cotra’s anchors is “All of the computing power used by all animals during the evolution of humanity.” which might sound like a reasonable upper-bound.This turns out to be over 90% nematodes. Scott thinks this is cool in how impressive the calculation is. I think this kind of thing is pretty damning for the project as a whole, and the impressiveness of the calculation actually makes it worse, not better.To me, the obvious implication of this is “Oh right, obviously the amount of computation used by individual nematodes has no bearing on how humans evolved, so this must be meaningless.” Instead, Ajeya left it in.Scott also argues it doesn’t matter, because “The highest and lowest anchors cancel out, so that the most plausible anchor – human brain with time horizon of hours to days – is around the average. If you remove all the other anchors and just keep that one, the model’s estimates barely change.”To give a ridiculous and hyperbolic comparison[3]: imagine if you were Prime Minister of some nation, and chief Economist estimated +2.5% growth in GDP next year, with a standard deviation of 1 percentage point.Then you looked at the 10th and 90th percentiles of the estimate, and the 10th percentile was -100% because of the return of Christ, bringing the human race to a paradise free of toil, and the 90th percentile was +10,000,000% because of aliens arriving and sharing their advanced technology.You would not be very satisfied if your Chancellor defended the Economist by saying “Well, these cancel out and the more reasonable estimations remain”: you would have serious doubts about the remaining 80 percentage points of estimation!You would probably fire your Economist (unless they had a really good justification for putting 10% probability each on first contact and the second coming, or something) and get a new report written.I think Yudkowsky was correct to have extremely serious doubts about the remainder of the estimates. The process generating the arguments had become uncoupled from reality in an obvious way, which should have made us suspicious about the rest of it.Contra Alexander on Compute and ParadigmsAlexander says:Yudkowsky further explained that his timelines were shorter than Bio Anchors because people would be working hard to discover new paradigms, and if the current paradigm would only pay off in the 2050s, then probably they would discover one before then. You could think of this as a disjunction: timelines will be shorter than Cotra thinks, either because deep learning pays off quickly, or because a new paradigm gets invented in the interim. It turned out to be the first one. So although Yudkowsky’s new paradigm has yet to materialize, his disjunctive reasoning in favor of shorter-than-2050 timelines was basically on the mark.I think this is an inaccurate way of thinking about things. AI progress has sped up due to the development of test-time compute scaling. GPT-5 is probably[4] less than our around 1 OOM bigger than GPT-4 in terms of training compute, while previous  scalings have been 2 OOMs.This is exactly the kind of thing that Yudkowsky was talking about! It explains a huge amount of the speed-up. It doesn’t really feel quite like a paradigm shift the way going from RNNs to transformers did, but it does change the scaling dynamics, the one thing that Ajeya was relying on.[5]The Final ScoreThe proof of the pudding is in the eating. The proof of a forecast is in whether it was correct. Ajeya’s was bad. Eliezer’s was much better. Yes, forecasting some things is hard, and maybe there really are cases where we should say “Martine did a great job, but got really unlucky.”. Scott even talks about this here. But Ajeya was wrong, and Eliezer was right, and that report probably made the EAs’ policies worse. Perhaps that’s all that should be said. Despite the negative effects of the report, I don’t really begrudge Ajeya. The EAs in 2020 were EAs ascendant, EAs triumphant over Pestilence and rushing to the war against AI risk. I think of them and I think of a WWI cavalry regiment charging forth from the glories of the 19th century into the barbed wire and artillery fire of the 20th.Perhaps the EAs will realize their efforts in these new battlefields are doomed and they need better epistemics, or perhaps they won’t. I would bet on the latter.CodaI’ll end with a warning about how one discusses one’s interlocutors. Imagine two boxers. One of them says in the pre-match interview “My opponent is a lily-livered yellowbelly, who suckles his thumb and cries to his ma when he loses!”. The fight happens, and that boxer goes down in the first round. Now he’s the guy who lost to a lily-livered yellowbelly. Rough.In the original Bio Anchors analysis, Scott says this:Ajeya’s report is a 169-page collection of equations, graphs, and modeling assumptions. Yudkowsky’s rebuttal is a fictional dialogue between himself, younger versions of himself, famous AI scientists, and other bit players. At one point, a character called “Humbali” shows up begging Yudkowsky to be more humble, and Yudkowsky defeats him with devastating counterarguments.Four years ago, this made Ajeya seem pretty sensible. Now… well, Ajeya is the person whose 169-page collection of equations, graphs, and modeling assumptions was outdone by a fictional dialogue where Yudkowsky epically owns a fictional character called “Humbali” who shows up and begs him to be more humble. Rough. ^It is over, AI optimist. I have already written a ten thousand word story where your views are loudly shouted by Moronicus and mine are explained calmly by Smartior.^At least it seems that way at the moment.^This comparison is ridiculous and hyperbolic. I am not saying that Ajeya’s work is as insane as this imaginary economist.^Claude argues:Claude GPT-5 Compute EstimateTotal Training Estimate: ~5×10²⁵ FLOPsEpoch AI estimates GPT-5 used approximately 5×10²⁵ FLOPs total, including both pre-training and reinforcement learning—more than twice GPT-4’s ~2×10²⁵ FLOPs, but less than GPT-4.5’s >10²⁶ FLOPs Substack.This is fascinating because GPT-5 actually broke the scaling trend—every previous GPT model had used roughly 2 orders of magnitude more compute than its predecessor. Had GPT-5 followed that pattern, it would have needed ~10²⁷ FLOPs.Architecture: Trained From ScratchGPT-5 was trained from scratch as a natively multimodal model on multiple modalities simultaneously, rather than building on GPT-4o or other pre-existing models Wikipedia. So it’s not using GPT-4o as a base—it’s a fresh pre-training run, likely on a mid-sized model (~100B active parameters, possibly MoE with 1-2T total params based on pricing/speed profiles).The Compute BreakdownPre-training: ~3×10²⁵ FLOPsTrained on 30-40T+ tokens (possibly higher given OpenAI’s investment in pre-training)Standard transformer compute: C ≈ 6ND for forward pass (where N = parameters, D = tokens), ~2× for backward pass with gradient checkpointing gives C ≈ 8ND totalRL Post-Training: 2×10²⁴ to 6×10²⁵ FLOPs (10-200% of pre-training)This is where it gets really interesting. In early 2025, RL compute was typically just 1-10% of pre-training, but this scaled rapidly—OpenAI increased RL by 10× from o1 to o3 Substack. The uncertainty about GPT-5 is whether they:Kept RL modest (~10-20% of pre-training) → ~3-6×10²⁴ FLOPsMatched pre-training compute (~100%) → ~3×10²⁵ FLOPsExceeded pre-training (~200%) → ~6×10²⁵ FLOPsFor o1-style reasoning models, RL can account for 40% or more of overall compute Interconnects. Reports suggest GPT-5’s training “wasn’t straightforward” with lots of experimentation, which points toward the middle-to-high end of that range. ^Previously, if you got  more compute, you’d train a  bigger model for  as many steps, and use  as much compute during deployment, because the model was just bigger. Now, roughly, you can train a  bigger model for  as many steps, and use  as many test-time tokens on a  bigger model (in theory, kinda) to scale your training and deployment compute by , instead of  and  as before.Discuss ​Read More

​Published on February 15, 2026 1:08 PM GMT Scott Alexander has a new piece out on Bio Anchors. For those who weren’t keeping up at the time, it went like this:Ajeya Cotra produced a long, deeply sourced report in 2020, which estimated AGI in ~2045, based on anchoring the development of AI to five different biological processes. Each of these anchors gave an estimation for the amount of compute required to produce a human-level intelligence. She then estimated how fast compute would grow.Eliezer Yudkowsky responded with an enormous fictional dialogue where he debates various characters on AI timelines before finally getting to Ajeya Cotra’s report, and estimates that AGI is will arrive much sooner than 2045.[1]Scott Alexander reviewed this debate.Eliezer turned out to be basically right on the timelines thing.[2] Scott Alexander did a retrospective.EDIT: Addditional DiscussionThere is also some discussion deep into a comment thread where Eliezer was discussing bio anchors, but the overall comment threads on that post are pretty messy. Daniel Kokotajlo has praised the model, and (allegedly) claimed that AI 2027 was based on it. His primary improvement to this was to throw out the Bio Anchors part, and used his own estimates of compute required for AGI. I’m not sure how to think of Ajeya’s contribution overall, since estimating the compute required for AGI is the hard part of this, which Ajeya got seriously wrong.Yudkowsky’s Point On Methodology StandsLet me first try to put words in both sides’ mouths.Cotra’s point seems to be: We don’t know what’s going on, let’s try and put together some prima facie good estimates of what might be going on, and then average them, like you would when doing a Fermi estimate.To which Yudkowsky responded: No, you don’t know what’s going on, so your estimates of what might be going on will be totally unconnected to reality, and you can’t get good numbers from things you don’t understand.I’d argue that the crux here is a question of which methods for generating arguments will produce arguments which bind to reality. Cotra believes that her methods will work even when she’s uncertain about the world, Yudkowsky believes they won’t. Fermi EstimationThe Fermi estimate thing makes sense to look at in more detail. Fermi estimation is the process by which you can usually get a decent (i.e. within 1 OOM) estimate of some quantity by just eyeballing/vibing out the intermediate variables. You can watch me fail to do one here:Me Failing To Fermi Estimate Car Dealerships In The USA classic example is estimating the number of car dealerships in the US. You guess how often you buy a car (every five years or so) how long that process takes you (two hours) how many salesmen a dealership employs (ten) how many hours a year they spend on sales (two thousand) and how many people are in the US (three hundred million).Then you multiply and divide those numbers:  car per year per person   hours per car   hours per employee per year   employees per dealership   people =  car dealerships.Real answer: 60,000.Claude guesses that car dealerships only have 3-5 salespersons, people spend more time with them than I thought (4-6 hours total, but the salesperson may only be present for part of that) and <100% utilization of salespersons.I was an OOM off. I may have had a better guess if I had ever bought a car or lived in the US!We can think of Fermi estimates like this as a random walk in log-space. My equation had five elements in it (though one, the US population, could be easily looked up). If each one is off by around a factor of 2, then that’s a step of  OOMs. With five times, we would expect the overall error to be  OOMs, or a factor of 4.6. This is pretty impressive for a guess which doesn’t require you to leave the house!Report-Based EpistemologyThe Bio-Anchors timeline report reminds me of another personal bugbear, the Rethink Priorities Welfare Range report. Both of them have a vast amount of effort put into them, contain a lot of sources, and a huge number of arguments. Both of them are clearly written by thoughtful people working in good faith. I also think that both of their conclusions are basically nonsense.Am I arguing against analysis in all cases? No, not at all! I think this kind of analysis is great for analysing e.g. which charity is better. You can solidly estimate the effectiveness of malaria nets, deworming, lead reduction, and direct giving using standard estimation techniques because none of the sub-questions involved require difficult out-of-distribution thinking.If we look back at the analysis I did earlier, estimating a  OOM error for a five-step Fermie estimate, we can see that it only works when each of your intermediate values is off by a factor of two! If one of your intermediates is off by a factor of infinity, then your final estimate is also off by a factor of infinity.Now, to be fair, Ajeya’s report—like most of EA-style reports—uses a two-fold estimation strategy: it comes up with five disjunct anchors, estimates each separately, and then hedges uncertainty over the anchors. This strategy seems good at first glance, but I think it actually hides problems rather than solves them.Averaging over different estimates can mean multiple things. It could be that you expect your different methods to give the same result, so by averaging, you’re compensating for noise in estimates (like, estimating malaria net effectiveness based on population-level studies vs multiplying net usage by cases prevented, or something). In this case, it’s great.This isn’t what Bio Anchors is doing, which is fine, because there are other good reasons to perform a superficially-similar estimate. If you’re uncertain of a higher-up variable in your causal graph, you might want to consider different options and average them.Unfortunately, this only works if the generator which produces these estimates is functioning correctly! If you’re estimating e.g. which malaria drug the UN will roll out, as a higher-up node in estimating Malaria funding needs, then this is doable and you can expect your estimates here to be sane. If not, then, well…The Nematode ThingOne of Ajeya Cotra’s anchors is “All of the computing power used by all animals during the evolution of humanity.” which might sound like a reasonable upper-bound.This turns out to be over 90% nematodes. Scott thinks this is cool in how impressive the calculation is. I think this kind of thing is pretty damning for the project as a whole, and the impressiveness of the calculation actually makes it worse, not better.To me, the obvious implication of this is “Oh right, obviously the amount of computation used by individual nematodes has no bearing on how humans evolved, so this must be meaningless.” Instead, Ajeya left it in.Scott also argues it doesn’t matter, because “The highest and lowest anchors cancel out, so that the most plausible anchor – human brain with time horizon of hours to days – is around the average. If you remove all the other anchors and just keep that one, the model’s estimates barely change.”To give a ridiculous and hyperbolic comparison[3]: imagine if you were Prime Minister of some nation, and chief Economist estimated +2.5% growth in GDP next year, with a standard deviation of 1 percentage point.Then you looked at the 10th and 90th percentiles of the estimate, and the 10th percentile was -100% because of the return of Christ, bringing the human race to a paradise free of toil, and the 90th percentile was +10,000,000% because of aliens arriving and sharing their advanced technology.You would not be very satisfied if your Chancellor defended the Economist by saying “Well, these cancel out and the more reasonable estimations remain”: you would have serious doubts about the remaining 80 percentage points of estimation!You would probably fire your Economist (unless they had a really good justification for putting 10% probability each on first contact and the second coming, or something) and get a new report written.I think Yudkowsky was correct to have extremely serious doubts about the remainder of the estimates. The process generating the arguments had become uncoupled from reality in an obvious way, which should have made us suspicious about the rest of it.Contra Alexander on Compute and ParadigmsAlexander says:Yudkowsky further explained that his timelines were shorter than Bio Anchors because people would be working hard to discover new paradigms, and if the current paradigm would only pay off in the 2050s, then probably they would discover one before then. You could think of this as a disjunction: timelines will be shorter than Cotra thinks, either because deep learning pays off quickly, or because a new paradigm gets invented in the interim. It turned out to be the first one. So although Yudkowsky’s new paradigm has yet to materialize, his disjunctive reasoning in favor of shorter-than-2050 timelines was basically on the mark.I think this is an inaccurate way of thinking about things. AI progress has sped up due to the development of test-time compute scaling. GPT-5 is probably[4] less than our around 1 OOM bigger than GPT-4 in terms of training compute, while previous  scalings have been 2 OOMs.This is exactly the kind of thing that Yudkowsky was talking about! It explains a huge amount of the speed-up. It doesn’t really feel quite like a paradigm shift the way going from RNNs to transformers did, but it does change the scaling dynamics, the one thing that Ajeya was relying on.[5]The Final ScoreThe proof of the pudding is in the eating. The proof of a forecast is in whether it was correct. Ajeya’s was bad. Eliezer’s was much better. Yes, forecasting some things is hard, and maybe there really are cases where we should say “Martine did a great job, but got really unlucky.”. Scott even talks about this here. But Ajeya was wrong, and Eliezer was right, and that report probably made the EAs’ policies worse. Perhaps that’s all that should be said. Despite the negative effects of the report, I don’t really begrudge Ajeya. The EAs in 2020 were EAs ascendant, EAs triumphant over Pestilence and rushing to the war against AI risk. I think of them and I think of a WWI cavalry regiment charging forth from the glories of the 19th century into the barbed wire and artillery fire of the 20th.Perhaps the EAs will realize their efforts in these new battlefields are doomed and they need better epistemics, or perhaps they won’t. I would bet on the latter.CodaI’ll end with a warning about how one discusses one’s interlocutors. Imagine two boxers. One of them says in the pre-match interview “My opponent is a lily-livered yellowbelly, who suckles his thumb and cries to his ma when he loses!”. The fight happens, and that boxer goes down in the first round. Now he’s the guy who lost to a lily-livered yellowbelly. Rough.In the original Bio Anchors analysis, Scott says this:Ajeya’s report is a 169-page collection of equations, graphs, and modeling assumptions. Yudkowsky’s rebuttal is a fictional dialogue between himself, younger versions of himself, famous AI scientists, and other bit players. At one point, a character called “Humbali” shows up begging Yudkowsky to be more humble, and Yudkowsky defeats him with devastating counterarguments.Four years ago, this made Ajeya seem pretty sensible. Now… well, Ajeya is the person whose 169-page collection of equations, graphs, and modeling assumptions was outdone by a fictional dialogue where Yudkowsky epically owns a fictional character called “Humbali” who shows up and begs him to be more humble. Rough. ^It is over, AI optimist. I have already written a ten thousand word story where your views are loudly shouted by Moronicus and mine are explained calmly by Smartior.^At least it seems that way at the moment.^This comparison is ridiculous and hyperbolic. I am not saying that Ajeya’s work is as insane as this imaginary economist.^Claude argues:Claude GPT-5 Compute EstimateTotal Training Estimate: ~5×10²⁵ FLOPsEpoch AI estimates GPT-5 used approximately 5×10²⁵ FLOPs total, including both pre-training and reinforcement learning—more than twice GPT-4’s ~2×10²⁵ FLOPs, but less than GPT-4.5’s >10²⁶ FLOPs Substack.This is fascinating because GPT-5 actually broke the scaling trend—every previous GPT model had used roughly 2 orders of magnitude more compute than its predecessor. Had GPT-5 followed that pattern, it would have needed ~10²⁷ FLOPs.Architecture: Trained From ScratchGPT-5 was trained from scratch as a natively multimodal model on multiple modalities simultaneously, rather than building on GPT-4o or other pre-existing models Wikipedia. So it’s not using GPT-4o as a base—it’s a fresh pre-training run, likely on a mid-sized model (~100B active parameters, possibly MoE with 1-2T total params based on pricing/speed profiles).The Compute BreakdownPre-training: ~3×10²⁵ FLOPsTrained on 30-40T+ tokens (possibly higher given OpenAI’s investment in pre-training)Standard transformer compute: C ≈ 6ND for forward pass (where N = parameters, D = tokens), ~2× for backward pass with gradient checkpointing gives C ≈ 8ND totalRL Post-Training: 2×10²⁴ to 6×10²⁵ FLOPs (10-200% of pre-training)This is where it gets really interesting. In early 2025, RL compute was typically just 1-10% of pre-training, but this scaled rapidly—OpenAI increased RL by 10× from o1 to o3 Substack. The uncertainty about GPT-5 is whether they:Kept RL modest (~10-20% of pre-training) → ~3-6×10²⁴ FLOPsMatched pre-training compute (~100%) → ~3×10²⁵ FLOPsExceeded pre-training (~200%) → ~6×10²⁵ FLOPsFor o1-style reasoning models, RL can account for 40% or more of overall compute Interconnects. Reports suggest GPT-5’s training “wasn’t straightforward” with lots of experimentation, which points toward the middle-to-high end of that range. ^Previously, if you got  more compute, you’d train a  bigger model for  as many steps, and use  as much compute during deployment, because the model was just bigger. Now, roughly, you can train a  bigger model for  as many steps, and use  as many test-time tokens on a  bigger model (in theory, kinda) to scale your training and deployment compute by , instead of  and  as before.Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *