Opinion

Mapping AI Capabilities to Human Expertise on the Rosetta Stone (Epoch Capabilities Index)

​This is a crosspost from the General-Purpose AI Policy Lab research blog.The “Rosetta Stone for AI Benchmarks” paper, by Epoch AI and Google DeepMind researchers, which underpins the Epoch Capability Index, gave us a great way to rank AI models and benchmarks on a common difficulty scale. But the resulting “capability score” is hard to interpret (what does a score of 2.54 mean?). We extended the framework to include human baselines.TL;DRThe Rosetta Stone framework produces relative capability scores for AI models, but these scores lack a real-world anchor (though they do correlate with METR Task Time Horizons, giving some indirect grounding).We integrate human performance baselines (ranging from crowd workers to PhD-level domain experts and top performers) directly into the Rosetta framework, giving the capability scale concrete human references.Some benchmarks have been specifically designed to be easy for humans but hard for AIs, which doesn’t match the assumption of a single axis of capability/difficulty. We performed the analysis both with and without these benchmarks.Main Results:Restricting to technical and scientific benchmark skills, current frontier models have crossed Average Humans (late 2022), Skilled Generalists (early 2024) and Domain Experts (2025) on the human expertise spectrum.Future models are forecasted to reach Top-Performer level by October 2027 (95% CI: May 2027 – March 2028). Though this timeline should be interpreted with a grain of salt given that benchmarks remain imperfect proxies and human baseline data is sparse.One bottleneck is that human performance data is inconsistently collected across difficulty levels. We need standardized, cross-difficulty human baselines to make this kind of calibration more robust and meaningful, and harder benchmarks to better estimate the human performance ceiling.Context: Comparing AI Progress to Human PerformanceShort Introduction to the Rosetta StoneThe pace of AI development is often reported through its technical components: more parameters, more FLOPs, or higher scores on specific benchmarks. Our work builds directly on the Rosetta Stone framework (closely related to the Epoch Capabilities Index), which follows this approach: the original paper introduced a method to estimate the capability of models based on benchmark difficulties. The key graph (Figure 1 in the original paper, reproduced below) shows that these estimated capabilities broadly match user intuition regarding model rankings. In one glance, the graph gives an idea of the relative ranking of models and benchmarks across time. This scoring method offers a form of “universal” ranking, allowing us to compare the capability of models that were never tested against one another, as well as the relative difficulty of benchmarks such as a PhD-level chemistry exam versus a set of cybersecurity-related tasks. Benchmarks (pink) that have yet to be saturated appear naturally above the cloud of models (teal).Figure 1 from A Rosetta Stone for AI Benchmarks. Estimated model capabilities and benchmark difficulties over time. 0 corresponds to the difficulty of the WinoGrande benchmark.Adding Human Baselines to RosettaKnowing that Gemini 2.5 Pro has a “Capability of 2.54” is not really meaningful on its own. The Rosetta Stone paper proposes to interpret model scores more quantitatively by looking at relative differences between models and mapping their capabilities to task time horizons. We build on this by anchoring these numbers to human expertise levels as real-world reference points, using human baseline scores from the literature.The Rosetta framework is built on the assumption that a common capability factor underlies all tasks, and that models can be placed on that difficulty dimension accordingly. By treating human expertise groups as “models” within the Rosetta database, we can calibrate this axis. This allows us to make concrete statements about where a model sits relative to, say, a PhD-level specialist.A Single Axis?Only approximately. Model-to-model comparison reveals differences beyond just capability (see Benchmark Scores = General Capability + Claudiness), and this effect is reinforced here given the jagged frontier of AI performance relative to human capabilities. Humans show a predictable hierarchy of skills: if a person can solve a PhD-level chemistry problem, we can safely assume they can also answer a common-sense question about temporal or spatial scenarios. AI models do not share this hierarchy and routinely invert this ordering.This creates a structural problem for the Rosetta framework:Certain benchmarks (like HellaSwag or ARC-AGI) were designed specifically to be trivial for humans but difficult for AI, while advanced technical benchmarks (GPQA Diamond, FrontierMath) are hard for both.When we introduce benchmarks that are human-easy but AI-hard, we break the assumption of a unified difficulty axis.Because the current Rosetta implementation doesn’t yet support multi-dimensional capability axes, and lots of human baselines are available for these common-sense benchmarks, they make average humans appear far more “capable” than they are in technical domains.The Rosetta Stone authors acknowledge this limitation. A natural extension of their framework would incorporate multiple difficulty axes. For now, we filter for benchmarks where the human difficulty axis and the AI difficulty axis are reasonably aligned: primarily technical and scientific tasks. 8 benchmarks out of 38 are thus removed in the main analysis. We have included the full, unfiltered results in Appendix 3.Results: Anchoring the Rosetta Stone with Human ReferencesWe categorized human baselines into four distinct tiers (see full methodology in Appendix 1), plus an aggregated “Committee” version for expert categories:Average Human: Crowd workers (e.g., MTurk) or non-specialized participants.Skilled Generalist: Individuals with advanced education but not in the target domain (e.g., PhD students in unrelated fields, skilled professionals).Domain Expert: PhD-level specialists in the relevant domain or expert professionals.Top Performer: Elite performers (e.g., Fields Medal mathematicians, top 5% test takers or best result).Committees: Aggregated majority votes or average team scores for the above groups.Figure 2 — Evolution of AI capabilities and benchmark difficulties compared to human levels (technical and scientific competencies). Horizontal dashed lines indicate calibrated human capability thresholds.Calibrated human scaleWhen technical and scientific benchmarks are isolated to align human and AI difficulty, an expected hierarchy emerges:GroupEstimated CapabilityAverage Human0.55Skilled Generalist1.54Domain Expert2.54Committee of Domain Experts2.97Top Performer4.53The Skilled Generalist, with a capability score of 1.54, sits just below Claude 3 Opus. Domain Experts land at 2.54, equal to Gemini 2.5 Pro. Both the Committee of Domain Experts and Top Performers score above GPT-5, currently the strongest model in the Rosetta database at 2.81.Current Standings: Models vs. ExpertsAs of 2026, frontier models have effectively crossed the thresholds for Average Humans (late 2022), Skilled Generalists (early 2024), and Domain Experts (2025). Current frontier models are reaching for the Committee of Domain Experts level: they are almost performing on par with aggregated professional teams on well-defined technical tasks.Two of the hardest benchmarks are GSO-Bench and FrontierMath, which respectively aim to test software optimization (improving code runtime efficiency against expert developer solutions) and research-grade mathematical problems that often require hours of expert collaboration to solve. These benchmarks approach real-world professional standards, although in simplified settings, indicating that while current models are reaching expert committees, elite individual performance remains out of reach for now.Note: These benchmarks are simplified proxies, not direct measures of real-world professional competence. Scoring at “Domain Expert level” on a benchmark does not mean models can replace domain experts in their actual work. Real-world tasks involve ambiguity, social context, judgment under uncertainty, and prolonged iterative work that benchmarks deliberately strip away. What these results do show is that models are increasingly capable of solving the kinds of technical, closed-ended problems experts solve, under controlled conditions, which is already remarkable.Projections: When will AI surpass the Expert Threshold?Linear extrapolations of the frontier models’ capabilities (the top three performers at any given release date) provide a time estimate for expected parity. Note that these projections assume a continuation of current scaling trends and do not account for other variables:Crossover with Average Human: December 2022.Crossover with Skilled Generalist: February 2024.Crossover with Domain Expert: May 2025.Crossover with Top Performer: This baseline represents a significant jump in performance, and in the projection, AI models intersect this baseline around October 2027 (95% Confidence Interval: May 2027 – March 2028, not accounting for uncertainty on the top performer performance threshold).Figure 3 — Projection of frontier AI capabilities toward human performance levels (technical and scientific competencies). The pink band represents the 95% confidence interval.Motivation for More Data The most immediate bottleneck in this analysis is data sparsity; most human groups have only 2 to 4 data points each, and coverage drops sharply outside a group’s primary expertise domain. We need cross-difficulty mapping of human performance for distinct sets of skills (e.g., technical, common sense). Concretely:Experts on easy and medium benchmarks, to establish a proper ceiling.Average participants on hard benchmarks, to anchor the lower end of the capability scale on difficult tasks.Consistent coverage across skill categories (e.g., technical, common sense, visual reasoning)Limits of Human BaselinesUnlike model evaluations, human scores come from different studies with wildly varying sample sizes (thousands of crowd workers vs. a handful of experts), incentives, time limits, and tool access. This noise compounds with ceiling effects on easy benchmarks and missing cross-difficulty data for each expertise tier.Wei et al. (ICML 2025) make a closely related point: existing human baselines in model evaluations are neither sufficiently rigorous nor sufficiently well-documented to robustly support human vs. AI comparisons. Their recommended checklist offers a useful standard for better-structured human metadata.Figure 4 — Recommended checklist by Wei et al. (ICML 2025) for collecting standardized human baselines in model evaluations.Limits of the single-axis modelThe second major limitation is the single-axis model itself. The Rosetta framework assumes one underlying difficulty dimension, but the jagged frontier problem suggests multiple axes: technical/ground-truth-based knowledge, common sense, fluid reasoning, cultural priors, etc. For example, Burnell et al. (2023), Maia Polo et al. (2025), Kipnis et al. (2025) and Ruan et al. (2024) find low-dimensional latent skills such as reasoning, language modeling, and instruction following. Extending Rosetta to handle multiple difficulty axes would let us represent and understand AI capabilities more faithfully. The authors acknowledge this as a natural next step, and we are currently working on this multi-axis extension of the difficulty scale.Future workBeyond these two, a few other directions seem worth exploring:Expanding benchmark coverage into professional domains (e.g., finance, law, medicine, computer science) where real-world human performance data exists. Breaking results down by domain (math, coding, biology, etc.) would also reveal where frontier models have actually crossed expert-level performance versus where the aggregate score flatters them.Harder and more realistic benchmarks. Models are saturating existing benchmarks faster than new ones appear. We need benchmarks like the Remote Labor Index to represent professional tasks under real constraints that can still discriminate at the frontier.Finer human expertise tiers. The distinction in capability between a PhD candidate and a researcher with years of experience, for example, would require more consistently labeled data, but would meaningfully clarify the upper end of the human scale.Committee and collaboration scores. Individual expert baselines are not a ceiling for what humans can achieve. Systematic data on small-group performance would give a more honest upper bound for human capability, especially relevant as AI systems are increasingly compared to teams rather than individuals.Appendix 1 Full MethodologyWe are building on Rosetta by adding human baseline groups on the same coordinate system, treating them as pseudo-models in the database. Human baselines are derived from scores reported in the literature on the same benchmarks used to evaluate AI models (see Appendix 2). We categorized these into four distinct tiers, plus an aggregated “Committee” version for expert categories:Average Human: Crowd workers (e.g., MTurk) or non-specialized participants.Skilled Generalist: Individuals with advanced education but not in the target domain (e.g., PhD students in unrelated fields, skilled professionals).Domain Expert: PhD-level specialists in the relevant domain or expert professionals.Top Performer: Elite performers (e.g., Fields Medal mathematicians, top 5% test takers or best result).Committees: Aggregated majority votes or average team scores for the above groups.This four-tier structure naturally arises from the existing literature; researchers generally report scores in minimally defined clusters. A significant number of technical benchmarks are assessed against either PhD-level specialists or PhD-level individuals from unrelated fields. Meanwhile, another large share of benchmark literature is crowd workers or non-specialized participants, broadly categorized as Average Humans. Above these three tiers, a small number of exceptional scores sit above the typical expert level without belonging to a clearly distinct credential category. These set the frontier of human performance, likely reflecting a combination of domain mastery, test-taking skill, and familiarity with the benchmark format.Finer resolution distinguishing, for example, a second-year PhD student from a senior researcher would be useful, but the current literature doesn’t yet provide enough consistently labeled data to support it.Methodological choicesRemoving on-purpose human-easy benchmarks: ARC-AGI, TriviaQA, HellaSwag, OpenBookQA, PIQA, SimpleBench, VPCT, and WinoGrande were removed entirely from the Rosetta database, scores included. Keeping them would distort both capability estimates and benchmark placement for everything else. To accommodate this change, the benchmark difficulty’s “anchor” defining the zero was changed from WinoGrande to ScienceQA (which had a difficulty score of 0.776).Introducing PRBench Finance: We integrated PRBench Finance as a test case for extending the Rosetta framework beyond its original benchmark set. PRBench Finance reflects professional, real-world finance tasks and allows us to add data for the Committee of Domain Experts baseline. Calculating uncertainty: We follow the original paper’s method. The margin of error surrounding the baselines and the models (represented as a 95% confidence interval) reflects the uncertainty inherent in testing on a finite set of benchmarks; this is defined by calculating how far a capability score can be shifted before causing a 5% increase in the model’s loss function.Projecting crossover dates: Crossover dates are linear extrapolations from the capability scores of the top 3 frontier models over time, assuming scaling continues roughly as it has. These are trajectories for building intuition.Appendix 2Human Baselines – Technical & ScientificBenchmarkHuman GroupScoreSourceInformationFrontierMath 2025Committee of Domain Experts0.35FrontierMath | Epoch AISolved collectively across all teams (40 exceptional math undergraduates and subject-matter experts) in four and a half hours and with internet accessPRBench FinanceCommittee of Domain Experts0.796PRBenchAgreement between human expertsGPQA diamondDomain Expert0.812GPQAExperts who have or are pursuing PhDs in the corresponding domainsGPQA diamondSkilled Generalist0.219GPQAHighly skilled and incentivized non-experts who have or are pursuing PhDs in other domainsGPQA diamondDomain Expert0.697GPQA Diamond | Epoch AIPhD-level experts recruited by OpenAIGSM8KSkilled Generalist0.968GSM8KQualified human annotators who have passed a qualification exam with at least a bachelor’s degreeGeoBenchTop Performer0.9GeoBenchTop playerMATH level 5Top Performer0.9MATH(all levels)Three-time IMO gold medalistMATH level 5Skilled Generalist0.4MATH(all levels)One computer science PhD studentMMLUDomain Expert0.898MMLUScore is an estimation from the authorsMMLUAverage Human0.345MMLUMTurkScienceQAAverage Human0.884ScienceQAMTurk with high school or higher degree and who have passed the qualification examplesOSWorldSkilled Generalist0.724OSWorldComputer science college students but not familiar with the softwareTriviaQAAverage Human0.797TriviaQAHuman performance levelHuman-Easy BaselinesBenchmarkHuman GroupScoreSourceInformationARC AGI 1Average Human0.77ARC PrizeMTurkARC AGI 1Committee of Average Humans0.98ARC PrizeHuman panel (at least two participants solved one or more sub-pairs within their first two attempts)ARC AGI 1Committee of Skilled Generalists0.98ARC PrizeSTEM GraduatesHellaSwagCommittee of Average Humans0.956HellaSwagMajority vote of 5 crowd workers (MTurk)OpenBookQAAverage Human0.92OpenBookQARandom human subjectsPIQACommittee of Skilled Generalists0.949PIQAMajority vote of top annotatorsSimpleBenchAverage Human0.837SimpleBenchNine non-specialized humansVPCTAverage Human1VPCT | Epoch AIThree volunteersWinoGrandeCommittee of Average Humans0.94WinoGrandeMajority vote of crowd workers (MTurk)Appendix 3All Benchmarks and BaselinesIncluding all benchmarks breaks the model’s intuitions immediately: The Average Human baseline appeared disproportionately capable, with an estimated capability even higher than the Domain Experts and Skilled Generalists.Most scores contributing to the Average Human baseline come from common sense benchmarks (easy for humans, historically hard for AI). They are tested on tasks that yield high points in the model’s difficulty weighting. Conversely, Skilled Generalists are almost exclusively tested on difficult scientific benchmarks (e.g., GPQA Diamond) where they naturally score lower than Domain Experts and Top Performers.Because the Rosetta graph measures difficulty relative to AI struggles, the high performance of Average Humans on “AI-hard” common sense tasks inflates their perceived capability.Additional experiment: Adding “filler” dataWe ran a separate analysis on the full benchmark set, common sense benchmarks included, while manually adding extrapolated score estimates to observe how the Average Human tier’s position changes relative to other expertise levels: If average humans scored >90% on a benchmark (e.g., WinoGrande, VPCT), we assigned a score of 100% to Skilled Generalists and Experts for that same task.These filler data are rough estimates; the exact values are not the point. The goal is to simulate what the human baseline scaffolding would look like with more uniform data coverage.As a result, most expert groups move back above the Average Human baseline. Average Human estimated capability remains higher than it should be. The anomaly is probably still missing data; we have no Average Human scores on hard benchmarks, so the model can’t place it correctly. More data in the right structure would likely resolve this. Discuss ​Read More

​This is a crosspost from the General-Purpose AI Policy Lab research blog.The “Rosetta Stone for AI Benchmarks” paper, by Epoch AI and Google DeepMind researchers, which underpins the Epoch Capability Index, gave us a great way to rank AI models and benchmarks on a common difficulty scale. But the resulting “capability score” is hard to interpret (what does a score of 2.54 mean?). We extended the framework to include human baselines.TL;DRThe Rosetta Stone framework produces relative capability scores for AI models, but these scores lack a real-world anchor (though they do correlate with METR Task Time Horizons, giving some indirect grounding).We integrate human performance baselines (ranging from crowd workers to PhD-level domain experts and top performers) directly into the Rosetta framework, giving the capability scale concrete human references.Some benchmarks have been specifically designed to be easy for humans but hard for AIs, which doesn’t match the assumption of a single axis of capability/difficulty. We performed the analysis both with and without these benchmarks.Main Results:Restricting to technical and scientific benchmark skills, current frontier models have crossed Average Humans (late 2022), Skilled Generalists (early 2024) and Domain Experts (2025) on the human expertise spectrum.Future models are forecasted to reach Top-Performer level by October 2027 (95% CI: May 2027 – March 2028). Though this timeline should be interpreted with a grain of salt given that benchmarks remain imperfect proxies and human baseline data is sparse.One bottleneck is that human performance data is inconsistently collected across difficulty levels. We need standardized, cross-difficulty human baselines to make this kind of calibration more robust and meaningful, and harder benchmarks to better estimate the human performance ceiling.Context: Comparing AI Progress to Human PerformanceShort Introduction to the Rosetta StoneThe pace of AI development is often reported through its technical components: more parameters, more FLOPs, or higher scores on specific benchmarks. Our work builds directly on the Rosetta Stone framework (closely related to the Epoch Capabilities Index), which follows this approach: the original paper introduced a method to estimate the capability of models based on benchmark difficulties. The key graph (Figure 1 in the original paper, reproduced below) shows that these estimated capabilities broadly match user intuition regarding model rankings. In one glance, the graph gives an idea of the relative ranking of models and benchmarks across time. This scoring method offers a form of “universal” ranking, allowing us to compare the capability of models that were never tested against one another, as well as the relative difficulty of benchmarks such as a PhD-level chemistry exam versus a set of cybersecurity-related tasks. Benchmarks (pink) that have yet to be saturated appear naturally above the cloud of models (teal).Figure 1 from A Rosetta Stone for AI Benchmarks. Estimated model capabilities and benchmark difficulties over time. 0 corresponds to the difficulty of the WinoGrande benchmark.Adding Human Baselines to RosettaKnowing that Gemini 2.5 Pro has a “Capability of 2.54” is not really meaningful on its own. The Rosetta Stone paper proposes to interpret model scores more quantitatively by looking at relative differences between models and mapping their capabilities to task time horizons. We build on this by anchoring these numbers to human expertise levels as real-world reference points, using human baseline scores from the literature.The Rosetta framework is built on the assumption that a common capability factor underlies all tasks, and that models can be placed on that difficulty dimension accordingly. By treating human expertise groups as “models” within the Rosetta database, we can calibrate this axis. This allows us to make concrete statements about where a model sits relative to, say, a PhD-level specialist.A Single Axis?Only approximately. Model-to-model comparison reveals differences beyond just capability (see Benchmark Scores = General Capability + Claudiness), and this effect is reinforced here given the jagged frontier of AI performance relative to human capabilities. Humans show a predictable hierarchy of skills: if a person can solve a PhD-level chemistry problem, we can safely assume they can also answer a common-sense question about temporal or spatial scenarios. AI models do not share this hierarchy and routinely invert this ordering.This creates a structural problem for the Rosetta framework:Certain benchmarks (like HellaSwag or ARC-AGI) were designed specifically to be trivial for humans but difficult for AI, while advanced technical benchmarks (GPQA Diamond, FrontierMath) are hard for both.When we introduce benchmarks that are human-easy but AI-hard, we break the assumption of a unified difficulty axis.Because the current Rosetta implementation doesn’t yet support multi-dimensional capability axes, and lots of human baselines are available for these common-sense benchmarks, they make average humans appear far more “capable” than they are in technical domains.The Rosetta Stone authors acknowledge this limitation. A natural extension of their framework would incorporate multiple difficulty axes. For now, we filter for benchmarks where the human difficulty axis and the AI difficulty axis are reasonably aligned: primarily technical and scientific tasks. 8 benchmarks out of 38 are thus removed in the main analysis. We have included the full, unfiltered results in Appendix 3.Results: Anchoring the Rosetta Stone with Human ReferencesWe categorized human baselines into four distinct tiers (see full methodology in Appendix 1), plus an aggregated “Committee” version for expert categories:Average Human: Crowd workers (e.g., MTurk) or non-specialized participants.Skilled Generalist: Individuals with advanced education but not in the target domain (e.g., PhD students in unrelated fields, skilled professionals).Domain Expert: PhD-level specialists in the relevant domain or expert professionals.Top Performer: Elite performers (e.g., Fields Medal mathematicians, top 5% test takers or best result).Committees: Aggregated majority votes or average team scores for the above groups.Figure 2 — Evolution of AI capabilities and benchmark difficulties compared to human levels (technical and scientific competencies). Horizontal dashed lines indicate calibrated human capability thresholds.Calibrated human scaleWhen technical and scientific benchmarks are isolated to align human and AI difficulty, an expected hierarchy emerges:GroupEstimated CapabilityAverage Human0.55Skilled Generalist1.54Domain Expert2.54Committee of Domain Experts2.97Top Performer4.53The Skilled Generalist, with a capability score of 1.54, sits just below Claude 3 Opus. Domain Experts land at 2.54, equal to Gemini 2.5 Pro. Both the Committee of Domain Experts and Top Performers score above GPT-5, currently the strongest model in the Rosetta database at 2.81.Current Standings: Models vs. ExpertsAs of 2026, frontier models have effectively crossed the thresholds for Average Humans (late 2022), Skilled Generalists (early 2024), and Domain Experts (2025). Current frontier models are reaching for the Committee of Domain Experts level: they are almost performing on par with aggregated professional teams on well-defined technical tasks.Two of the hardest benchmarks are GSO-Bench and FrontierMath, which respectively aim to test software optimization (improving code runtime efficiency against expert developer solutions) and research-grade mathematical problems that often require hours of expert collaboration to solve. These benchmarks approach real-world professional standards, although in simplified settings, indicating that while current models are reaching expert committees, elite individual performance remains out of reach for now.Note: These benchmarks are simplified proxies, not direct measures of real-world professional competence. Scoring at “Domain Expert level” on a benchmark does not mean models can replace domain experts in their actual work. Real-world tasks involve ambiguity, social context, judgment under uncertainty, and prolonged iterative work that benchmarks deliberately strip away. What these results do show is that models are increasingly capable of solving the kinds of technical, closed-ended problems experts solve, under controlled conditions, which is already remarkable.Projections: When will AI surpass the Expert Threshold?Linear extrapolations of the frontier models’ capabilities (the top three performers at any given release date) provide a time estimate for expected parity. Note that these projections assume a continuation of current scaling trends and do not account for other variables:Crossover with Average Human: December 2022.Crossover with Skilled Generalist: February 2024.Crossover with Domain Expert: May 2025.Crossover with Top Performer: This baseline represents a significant jump in performance, and in the projection, AI models intersect this baseline around October 2027 (95% Confidence Interval: May 2027 – March 2028, not accounting for uncertainty on the top performer performance threshold).Figure 3 — Projection of frontier AI capabilities toward human performance levels (technical and scientific competencies). The pink band represents the 95% confidence interval.Motivation for More Data The most immediate bottleneck in this analysis is data sparsity; most human groups have only 2 to 4 data points each, and coverage drops sharply outside a group’s primary expertise domain. We need cross-difficulty mapping of human performance for distinct sets of skills (e.g., technical, common sense). Concretely:Experts on easy and medium benchmarks, to establish a proper ceiling.Average participants on hard benchmarks, to anchor the lower end of the capability scale on difficult tasks.Consistent coverage across skill categories (e.g., technical, common sense, visual reasoning)Limits of Human BaselinesUnlike model evaluations, human scores come from different studies with wildly varying sample sizes (thousands of crowd workers vs. a handful of experts), incentives, time limits, and tool access. This noise compounds with ceiling effects on easy benchmarks and missing cross-difficulty data for each expertise tier.Wei et al. (ICML 2025) make a closely related point: existing human baselines in model evaluations are neither sufficiently rigorous nor sufficiently well-documented to robustly support human vs. AI comparisons. Their recommended checklist offers a useful standard for better-structured human metadata.Figure 4 — Recommended checklist by Wei et al. (ICML 2025) for collecting standardized human baselines in model evaluations.Limits of the single-axis modelThe second major limitation is the single-axis model itself. The Rosetta framework assumes one underlying difficulty dimension, but the jagged frontier problem suggests multiple axes: technical/ground-truth-based knowledge, common sense, fluid reasoning, cultural priors, etc. For example, Burnell et al. (2023), Maia Polo et al. (2025), Kipnis et al. (2025) and Ruan et al. (2024) find low-dimensional latent skills such as reasoning, language modeling, and instruction following. Extending Rosetta to handle multiple difficulty axes would let us represent and understand AI capabilities more faithfully. The authors acknowledge this as a natural next step, and we are currently working on this multi-axis extension of the difficulty scale.Future workBeyond these two, a few other directions seem worth exploring:Expanding benchmark coverage into professional domains (e.g., finance, law, medicine, computer science) where real-world human performance data exists. Breaking results down by domain (math, coding, biology, etc.) would also reveal where frontier models have actually crossed expert-level performance versus where the aggregate score flatters them.Harder and more realistic benchmarks. Models are saturating existing benchmarks faster than new ones appear. We need benchmarks like the Remote Labor Index to represent professional tasks under real constraints that can still discriminate at the frontier.Finer human expertise tiers. The distinction in capability between a PhD candidate and a researcher with years of experience, for example, would require more consistently labeled data, but would meaningfully clarify the upper end of the human scale.Committee and collaboration scores. Individual expert baselines are not a ceiling for what humans can achieve. Systematic data on small-group performance would give a more honest upper bound for human capability, especially relevant as AI systems are increasingly compared to teams rather than individuals.Appendix 1 Full MethodologyWe are building on Rosetta by adding human baseline groups on the same coordinate system, treating them as pseudo-models in the database. Human baselines are derived from scores reported in the literature on the same benchmarks used to evaluate AI models (see Appendix 2). We categorized these into four distinct tiers, plus an aggregated “Committee” version for expert categories:Average Human: Crowd workers (e.g., MTurk) or non-specialized participants.Skilled Generalist: Individuals with advanced education but not in the target domain (e.g., PhD students in unrelated fields, skilled professionals).Domain Expert: PhD-level specialists in the relevant domain or expert professionals.Top Performer: Elite performers (e.g., Fields Medal mathematicians, top 5% test takers or best result).Committees: Aggregated majority votes or average team scores for the above groups.This four-tier structure naturally arises from the existing literature; researchers generally report scores in minimally defined clusters. A significant number of technical benchmarks are assessed against either PhD-level specialists or PhD-level individuals from unrelated fields. Meanwhile, another large share of benchmark literature is crowd workers or non-specialized participants, broadly categorized as Average Humans. Above these three tiers, a small number of exceptional scores sit above the typical expert level without belonging to a clearly distinct credential category. These set the frontier of human performance, likely reflecting a combination of domain mastery, test-taking skill, and familiarity with the benchmark format.Finer resolution distinguishing, for example, a second-year PhD student from a senior researcher would be useful, but the current literature doesn’t yet provide enough consistently labeled data to support it.Methodological choicesRemoving on-purpose human-easy benchmarks: ARC-AGI, TriviaQA, HellaSwag, OpenBookQA, PIQA, SimpleBench, VPCT, and WinoGrande were removed entirely from the Rosetta database, scores included. Keeping them would distort both capability estimates and benchmark placement for everything else. To accommodate this change, the benchmark difficulty’s “anchor” defining the zero was changed from WinoGrande to ScienceQA (which had a difficulty score of 0.776).Introducing PRBench Finance: We integrated PRBench Finance as a test case for extending the Rosetta framework beyond its original benchmark set. PRBench Finance reflects professional, real-world finance tasks and allows us to add data for the Committee of Domain Experts baseline. Calculating uncertainty: We follow the original paper’s method. The margin of error surrounding the baselines and the models (represented as a 95% confidence interval) reflects the uncertainty inherent in testing on a finite set of benchmarks; this is defined by calculating how far a capability score can be shifted before causing a 5% increase in the model’s loss function.Projecting crossover dates: Crossover dates are linear extrapolations from the capability scores of the top 3 frontier models over time, assuming scaling continues roughly as it has. These are trajectories for building intuition.Appendix 2Human Baselines – Technical & ScientificBenchmarkHuman GroupScoreSourceInformationFrontierMath 2025Committee of Domain Experts0.35FrontierMath | Epoch AISolved collectively across all teams (40 exceptional math undergraduates and subject-matter experts) in four and a half hours and with internet accessPRBench FinanceCommittee of Domain Experts0.796PRBenchAgreement between human expertsGPQA diamondDomain Expert0.812GPQAExperts who have or are pursuing PhDs in the corresponding domainsGPQA diamondSkilled Generalist0.219GPQAHighly skilled and incentivized non-experts who have or are pursuing PhDs in other domainsGPQA diamondDomain Expert0.697GPQA Diamond | Epoch AIPhD-level experts recruited by OpenAIGSM8KSkilled Generalist0.968GSM8KQualified human annotators who have passed a qualification exam with at least a bachelor’s degreeGeoBenchTop Performer0.9GeoBenchTop playerMATH level 5Top Performer0.9MATH(all levels)Three-time IMO gold medalistMATH level 5Skilled Generalist0.4MATH(all levels)One computer science PhD studentMMLUDomain Expert0.898MMLUScore is an estimation from the authorsMMLUAverage Human0.345MMLUMTurkScienceQAAverage Human0.884ScienceQAMTurk with high school or higher degree and who have passed the qualification examplesOSWorldSkilled Generalist0.724OSWorldComputer science college students but not familiar with the softwareTriviaQAAverage Human0.797TriviaQAHuman performance levelHuman-Easy BaselinesBenchmarkHuman GroupScoreSourceInformationARC AGI 1Average Human0.77ARC PrizeMTurkARC AGI 1Committee of Average Humans0.98ARC PrizeHuman panel (at least two participants solved one or more sub-pairs within their first two attempts)ARC AGI 1Committee of Skilled Generalists0.98ARC PrizeSTEM GraduatesHellaSwagCommittee of Average Humans0.956HellaSwagMajority vote of 5 crowd workers (MTurk)OpenBookQAAverage Human0.92OpenBookQARandom human subjectsPIQACommittee of Skilled Generalists0.949PIQAMajority vote of top annotatorsSimpleBenchAverage Human0.837SimpleBenchNine non-specialized humansVPCTAverage Human1VPCT | Epoch AIThree volunteersWinoGrandeCommittee of Average Humans0.94WinoGrandeMajority vote of crowd workers (MTurk)Appendix 3All Benchmarks and BaselinesIncluding all benchmarks breaks the model’s intuitions immediately: The Average Human baseline appeared disproportionately capable, with an estimated capability even higher than the Domain Experts and Skilled Generalists.Most scores contributing to the Average Human baseline come from common sense benchmarks (easy for humans, historically hard for AI). They are tested on tasks that yield high points in the model’s difficulty weighting. Conversely, Skilled Generalists are almost exclusively tested on difficult scientific benchmarks (e.g., GPQA Diamond) where they naturally score lower than Domain Experts and Top Performers.Because the Rosetta graph measures difficulty relative to AI struggles, the high performance of Average Humans on “AI-hard” common sense tasks inflates their perceived capability.Additional experiment: Adding “filler” dataWe ran a separate analysis on the full benchmark set, common sense benchmarks included, while manually adding extrapolated score estimates to observe how the Average Human tier’s position changes relative to other expertise levels: If average humans scored >90% on a benchmark (e.g., WinoGrande, VPCT), we assigned a score of 100% to Skilled Generalists and Experts for that same task.These filler data are rough estimates; the exact values are not the point. The goal is to simulate what the human baseline scaffolding would look like with more uniform data coverage.As a result, most expert groups move back above the Average Human baseline. Average Human estimated capability remains higher than it should be. The anomaly is probably still missing data; we have no Average Human scores on hard benchmarks, so the model can’t place it correctly. More data in the right structure would likely resolve this. Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *