Opinion

Toward a Better Evaluations Ecosystem

​Model evaluations are broken. Numbers that are often cited alongside one another as evidence of progress are rarely comparable due to inconsistent methodologies, and AI companies run and report internal evals that are unavailable to the wider community. But we can fix this.We are making deployment and safety decisions based on numbers that do not mean what people think they mean. Every other high-stakes industry has solved this the same way, by taking the measurements out of the hands of the companies being measured, shifting this to third-party auditors.Capability and safety benchmarks are important for informing release decisions, risk reports and safety frameworks. They show when models are reliable enough to produce quality work or scientific contributions. Benchmark results are also an important tool for tracking progress over time on a fixed dataset. Most people do not read system cards or the precise methodology that was used to calculate the headline statistic. They see a number, compare it to the last number, and post about it as definitive evidence we’re one step closer to AGI. The problem is that the numbers they are comparing were often produced under fundamentally different conditions.The ProblemI realized the extent of these differences when I wrote OpenAI’s blog post detailing significant issues with SWE-bench Verified, where we determined it was no longer a meaningful signal for software development capabilities.Anthropic has made changes to the methodology in nearly every release since Claude 3.7. For that model, Sonnet was evaluated on 489 of the 500 tasks in the dataset and given three tools: a bash tool, a file editing tool and a planning tool. For Claude 4, Anthropic ran the entire dataset and eliminated the planning tool and turned off reasoning, and the same setup seems to have carried over to Opus 4.5, which was averaged over 5 trials. Then with Opus 4.6, Anthropic enabled reasoning and the results are averaged over 25 trials. For Mythos and Opus 4.7, SWE-bench Verified results were again changed to report the average over 5 trials, with default sampling settings and maximum reasoning effort.OpenAI has been relatively more consistent, but their results also point to how changes in the setup influence final results. Since o3-mini, SWE-bench Verified results were reported on a 477 task subset of the full dataset, meaning results have been inherently incomparable between the two companies. The results for o3-mini were pass@1 averaged over 4 trials. This remained consistent when methodology was detailed until the benchmark was retired.Google was light on details with the release of Gemini 2.5, but subsequent releases have included a dedicated document on methodology. Gemini 3 results for SWE-bench Verified were averaged over 10 trials, and run using the public API with a bash tool, a file operations tool and a submit tool. Google seems to have used the entire dataset.Table 2.3.A from the Claude Opus 4.6 System Card with SWE-bench Verified results. None of the numbers in this chart (except for Opus 4.5 and Sonnet 4.5) are comparable.Other benchmarks also suffer from repeated changes in methodology. Anthropic repeatedly switched between an average of 5 and 10 trials for reporting GPQA for Sonnet 4.6 and Opus 4.5. OpenAI changed the number of rollouts from 64 to 8 in majority voting for AIME between o1 and o3 (both without tools), and then seemingly eliminated consensus by GPT-5 and added tools. A footnote with bars for each model states “AIME results with tools should not be compared directly to the performance of models without tool access; they are an example of how effectively GPT‑5 leverages available tools.” But many of OpenAI’s reported results don’t include the number of trials or how the eval was run.All of these differences are clear before even addressing custom tools or harnesses, which can change results dramatically. While ideally all models would be run under the same conditions from top to bottom, there is an argument to be made that models are trained to function well with specific tools or in a specific harness.In practical terms, I understand why methodologies may change from release to release. The weeks before a new model launch are intense and can be chaotic. Maybe there’s an eval intended to run for 10 trials but infrastructure issues meant three are invalid. Everyone else at the company is ready to push the launch button and so instead of rerunning everything, results are reported over five trials. But that doesn’t mean we should be complacent.The SolutionEvaluations reported as part of model releases and system cards should be run with a standardized methodology by a third party. This could be a single entity or ideally separate organizations for various specializations, like coding, biosecurity, cybersecurity, and other economically useful tasks. Domain expertise is valuable for ensuring results are accurate and specialized auditors could simultaneously create new, rigorously researched evals that are then available to everyone. Benchmarks are quickly being saturated and building new ones is a complex and expensive endeavor.AI companies would submit a model checkpoint to be evaluated via an API, at least a week ahead of deployment, and inform the third-party auditor which evals they would like to be run. That checkpoint should be extremely similar to what is eventually released to the public, although verifying this is itself a nontrivial problem. The auditor would also commit to publicly releasing the results at launch. If a company wants evaluations run with a specific harness or tools, an external auditor could run a company-specific tooling and a standardized setup, with both numbers reported.The best outcome would be for external auditor organizations to be non-profits funded by the companies themselves, governments and outside donors. Importantly, no one funder should be the linchpin of any single entity and the companies should not hold any board seats.Currently this is done for some safety evaluations in partnership with organizations like Apollo and METR, but this regime should be extended to all benchmark scores reported in model releases. Epoch AI and Artificial Analysis do some post-deployment evaluations on open source benchmarks, but these are not the numbers that are most often cited by the wider community. Reports suggest the US government may start auditing models before deployment, although it’s unclear exactly what that will look like.Beyond standardizing methodologies, this regime could allow evals that are currently private to be run across a wider range of models. AI companies could submit evals that are currently internal to third-party runners. The eval itself stays private but results and measurements can be distributed. This would also allow deeper insights across different types of evals like Anthropic’s agentic misalignment evaluations or OpenAI’s private split of GDPval.While the community waits for these organizations to take shape, a short-term solution would be for AI companies to jointly agree on shared methodologies for running open source benchmarks. Eval results from this consensus should be reported for current and deprecated models in a joint report co-signed by participating labs, establishing these as shared standards.The PrecedentThere is precedent for this kind of regime. Euro NCAP has provided independent, standardized crash test ratings for new cars since 1997, and while testing is voluntary, manufacturers actively participate because strong safety ratings sell cars. But car companies initially attacked the program, claiming the tests were so severe that no car could achieve a top rating, until five months later when Volvo proved them wrong.Financial auditing follows a similar pattern at a more institutional scale. After the Great Depression exposed how misleading self-reported financials could be, the U.S. eventually established GAAP and, after another round of corporate scandals, the Public Company Accounting Oversight Board in 2002 to independently oversee audits. In both cases, the same basic problem existed: entities reporting their own numbers under their own methodologies, with no way for outsiders to compare results across companies.Standardized methodology does not mean frozen methodology. Euro NCAP has updated its tests repeatedly over nearly three decades, but changes are clearly documented and ratings from different eras are explicitly marked as incomparable. That is the minimum bar AI evaluations should meet.There may be little incentive for AI companies to voluntarily join this regime. New, big numbers fuel hype cycles and currently everyone benefits from the ambiguity of self-reporting. But that ambiguity only holds as long as everyone opts out. The moment one company submits to independent evaluation, they get to frame themselves as the transparent one, and every competitor’s refusal becomes conspicuous. That dynamic may be enough.A stronger lever could be government or enterprise procurement policies requiring models be evaluated by a third party before large, lucrative contracts can be signed. Large buyers already run their own eval suites, and this system would just make it more efficient for everyone.Standardized evaluations are only part of how a model’s capabilities are ultimately judged. Many people have their own set of tasks they give every new model. This is insightful in its own way, but lacks the rigor and comparability of a standardized benchmark run under controlled conditions. We need high confidence in model capabilities in order to make informed safety and deployment decisions, and that rigor is currently lacking.I would like to thank Mia Hopman, Pablo Bernabeu Pérez, James Aung and Dani Balcells for feedback on early drafts of this post.Discuss ​Read More

​Model evaluations are broken. Numbers that are often cited alongside one another as evidence of progress are rarely comparable due to inconsistent methodologies, and AI companies run and report internal evals that are unavailable to the wider community. But we can fix this.We are making deployment and safety decisions based on numbers that do not mean what people think they mean. Every other high-stakes industry has solved this the same way, by taking the measurements out of the hands of the companies being measured, shifting this to third-party auditors.Capability and safety benchmarks are important for informing release decisions, risk reports and safety frameworks. They show when models are reliable enough to produce quality work or scientific contributions. Benchmark results are also an important tool for tracking progress over time on a fixed dataset. Most people do not read system cards or the precise methodology that was used to calculate the headline statistic. They see a number, compare it to the last number, and post about it as definitive evidence we’re one step closer to AGI. The problem is that the numbers they are comparing were often produced under fundamentally different conditions.The ProblemI realized the extent of these differences when I wrote OpenAI’s blog post detailing significant issues with SWE-bench Verified, where we determined it was no longer a meaningful signal for software development capabilities.Anthropic has made changes to the methodology in nearly every release since Claude 3.7. For that model, Sonnet was evaluated on 489 of the 500 tasks in the dataset and given three tools: a bash tool, a file editing tool and a planning tool. For Claude 4, Anthropic ran the entire dataset and eliminated the planning tool and turned off reasoning, and the same setup seems to have carried over to Opus 4.5, which was averaged over 5 trials. Then with Opus 4.6, Anthropic enabled reasoning and the results are averaged over 25 trials. For Mythos and Opus 4.7, SWE-bench Verified results were again changed to report the average over 5 trials, with default sampling settings and maximum reasoning effort.OpenAI has been relatively more consistent, but their results also point to how changes in the setup influence final results. Since o3-mini, SWE-bench Verified results were reported on a 477 task subset of the full dataset, meaning results have been inherently incomparable between the two companies. The results for o3-mini were pass@1 averaged over 4 trials. This remained consistent when methodology was detailed until the benchmark was retired.Google was light on details with the release of Gemini 2.5, but subsequent releases have included a dedicated document on methodology. Gemini 3 results for SWE-bench Verified were averaged over 10 trials, and run using the public API with a bash tool, a file operations tool and a submit tool. Google seems to have used the entire dataset.Table 2.3.A from the Claude Opus 4.6 System Card with SWE-bench Verified results. None of the numbers in this chart (except for Opus 4.5 and Sonnet 4.5) are comparable.Other benchmarks also suffer from repeated changes in methodology. Anthropic repeatedly switched between an average of 5 and 10 trials for reporting GPQA for Sonnet 4.6 and Opus 4.5. OpenAI changed the number of rollouts from 64 to 8 in majority voting for AIME between o1 and o3 (both without tools), and then seemingly eliminated consensus by GPT-5 and added tools. A footnote with bars for each model states “AIME results with tools should not be compared directly to the performance of models without tool access; they are an example of how effectively GPT‑5 leverages available tools.” But many of OpenAI’s reported results don’t include the number of trials or how the eval was run.All of these differences are clear before even addressing custom tools or harnesses, which can change results dramatically. While ideally all models would be run under the same conditions from top to bottom, there is an argument to be made that models are trained to function well with specific tools or in a specific harness.In practical terms, I understand why methodologies may change from release to release. The weeks before a new model launch are intense and can be chaotic. Maybe there’s an eval intended to run for 10 trials but infrastructure issues meant three are invalid. Everyone else at the company is ready to push the launch button and so instead of rerunning everything, results are reported over five trials. But that doesn’t mean we should be complacent.The SolutionEvaluations reported as part of model releases and system cards should be run with a standardized methodology by a third party. This could be a single entity or ideally separate organizations for various specializations, like coding, biosecurity, cybersecurity, and other economically useful tasks. Domain expertise is valuable for ensuring results are accurate and specialized auditors could simultaneously create new, rigorously researched evals that are then available to everyone. Benchmarks are quickly being saturated and building new ones is a complex and expensive endeavor.AI companies would submit a model checkpoint to be evaluated via an API, at least a week ahead of deployment, and inform the third-party auditor which evals they would like to be run. That checkpoint should be extremely similar to what is eventually released to the public, although verifying this is itself a nontrivial problem. The auditor would also commit to publicly releasing the results at launch. If a company wants evaluations run with a specific harness or tools, an external auditor could run a company-specific tooling and a standardized setup, with both numbers reported.The best outcome would be for external auditor organizations to be non-profits funded by the companies themselves, governments and outside donors. Importantly, no one funder should be the linchpin of any single entity and the companies should not hold any board seats.Currently this is done for some safety evaluations in partnership with organizations like Apollo and METR, but this regime should be extended to all benchmark scores reported in model releases. Epoch AI and Artificial Analysis do some post-deployment evaluations on open source benchmarks, but these are not the numbers that are most often cited by the wider community. Reports suggest the US government may start auditing models before deployment, although it’s unclear exactly what that will look like.Beyond standardizing methodologies, this regime could allow evals that are currently private to be run across a wider range of models. AI companies could submit evals that are currently internal to third-party runners. The eval itself stays private but results and measurements can be distributed. This would also allow deeper insights across different types of evals like Anthropic’s agentic misalignment evaluations or OpenAI’s private split of GDPval.While the community waits for these organizations to take shape, a short-term solution would be for AI companies to jointly agree on shared methodologies for running open source benchmarks. Eval results from this consensus should be reported for current and deprecated models in a joint report co-signed by participating labs, establishing these as shared standards.The PrecedentThere is precedent for this kind of regime. Euro NCAP has provided independent, standardized crash test ratings for new cars since 1997, and while testing is voluntary, manufacturers actively participate because strong safety ratings sell cars. But car companies initially attacked the program, claiming the tests were so severe that no car could achieve a top rating, until five months later when Volvo proved them wrong.Financial auditing follows a similar pattern at a more institutional scale. After the Great Depression exposed how misleading self-reported financials could be, the U.S. eventually established GAAP and, after another round of corporate scandals, the Public Company Accounting Oversight Board in 2002 to independently oversee audits. In both cases, the same basic problem existed: entities reporting their own numbers under their own methodologies, with no way for outsiders to compare results across companies.Standardized methodology does not mean frozen methodology. Euro NCAP has updated its tests repeatedly over nearly three decades, but changes are clearly documented and ratings from different eras are explicitly marked as incomparable. That is the minimum bar AI evaluations should meet.There may be little incentive for AI companies to voluntarily join this regime. New, big numbers fuel hype cycles and currently everyone benefits from the ambiguity of self-reporting. But that ambiguity only holds as long as everyone opts out. The moment one company submits to independent evaluation, they get to frame themselves as the transparent one, and every competitor’s refusal becomes conspicuous. That dynamic may be enough.A stronger lever could be government or enterprise procurement policies requiring models be evaluated by a third party before large, lucrative contracts can be signed. Large buyers already run their own eval suites, and this system would just make it more efficient for everyone.Standardized evaluations are only part of how a model’s capabilities are ultimately judged. Many people have their own set of tasks they give every new model. This is insightful in its own way, but lacks the rigor and comparability of a standardized benchmark run under controlled conditions. We need high confidence in model capabilities in order to make informed safety and deployment decisions, and that rigor is currently lacking.I would like to thank Mia Hopman, Pablo Bernabeu Pérez, James Aung and Dani Balcells for feedback on early drafts of this post.Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *