Opinion

A Fast and Loose Clustering of LLM Benchmarks

AI Benchmarks measure a variety of distinct skills, from agency to general knowledge to spatial reasoning. Two benchmarks may measure similar traits if AI models which perform well on one also perform well on the other. Moreover, these connections might be nonobvious from the descriptions of the benchmarks. This is a rough first pass at clustering benchmarks into groups based upon this type of similarity, and the Claude Coded experiment can be found at this github repo.We have lots of AI benchmarks. They measure things like:Coding ability (e.g. SWE Bench, Webdev Arena)General Knowledge (e.g. GPQA Diamond, SimpleQA, MMLU)Mathematical Reasoning (e.g. FrontierMath, OTIS Mock AIME)Long tasks and agency (e.g. METR Time Horizons, GSO)So, we can compare how good different LLMs and AI agents are by measuring their performance across different benchmarks. This can get messy, but still gives us a good sense of which models are most capable best at any given time.In fact, with only a little bit of statistical fiddling and optimization (see here for details), Epoch AI integrated 37 benchmarks into their “Epoch Capabilities Index,” and this single number seems to capture which models are on top quite well.Image courtesy of Epoch AIFor example, the best models right now are GPT 5.4, Gemini 3.1, and Claude Opus 4.6 (although on NumberInNameBench, OpenAI is outperforming Google and Anthropic by 74% and 17.4% respectively).Of course, this single number is necessarily quite reductive. At the time of release, the hardest benchmarks in Epoch’s models wereOS Universe, “a benchmark of complex, multimodal desktop-oriented tasks” [1] GSO Bench, “a collection of software performance optimization challenges” [2]Factorio Learning Environment, a suite of tasks in the video game Factorio (e.g. “Build an automatic iron gear wheel factory”)FrontierMath Tier 4, a set of wildly difficult math problems that postdocs and professors spent weeks writing [3]Why is it that I feel number 3 and number 4 don’t perfectly align… of course, while all these benchmarks top the charts in difficulty, they are measuring wildly different things. Indeed, one could imagine a huge improvement in computer use and long term task coherence which leads models to ace problems in Factorio and GUI manipulation while hardly improving performance on cutting edge mathematics at all.In the era of unscaffolded chatbots, we might only have needed to measure basic reasoning and general knowledge. As models improved, it became relevant to test their coding ability as well. Now that companies are racing to build AI agents which can automate remote labor, we have a suite of tests which measure computer use and performance on long tasks. Putting all of these in the same bucket feels strange, and many people and companies have already moved on from this. In OpenAI’s announcement of GPT-5.4 they have separate sections on coding, tool use, knowledge work, and computer use/vision, each with their own benchmarks.At first blush, it may seem obvious how to group these benchmarks: if it has SWE in the name it’s probably about coding, FrontierMath surely correlates with math olympiad performance, and so on.Still, I wanted to cluster benchmarks using statistics. I suspected some of the results would be surprising; even to start with, I couldn’t guess whether chess puzzles naturally group in with coding ability, or math, or computer use. Will playing GeoGuessr land in a group of general knowledge benchmarks, or perhaps with those focused on visual reasoning? Their are certainly limitations to what we can find out, but I came away from my experiment with some fascinating nuggets (chess puzzles are AGI).MethodsWe want to say which benchmarks are similar without manually examining their actual semantic content. That basically leaves us with one other type of information, which is how well different models perform on these benchmarks. If the same set of models perform well and poorly on benchmark A and benchmark B, then it seems likely that the benchmarks are measuring similar things.We use a set of benchmarks and model evaluations on those benchmarks from Epoch, once again. You can find the data here. We start with 577 models and 50 different benchmarks.Let Z_A(m) be the Z score of model m among all models on benchmark A. We use Z scores instead of raw benchmark scores because not all benchmarks have scores in the same format, and we don’t want correlations between our benchmark data to be dominated by high variance benchmarks. My first idea was to create a vector v_A for each benchmark A where the mth index, representing the performance of model m on bench mark A, is Z_A(m):Then, we can perform standard clustering of our data points in R^M, where M is the number of models. Unfortunately, most of the models are not tested on many of the benchmarks. I tried just plugging in 0, an average score, for Z_A(m) if model m was never evaluated on benchmark A, but this ended up clustering benchmarks based upon what models they are tested on more than anything. Specifically, we end up with benchmarks clustered together based on when they were released, and thus which models were tested on them, rather than clustering benchmarks with similar content like we intend.The stopgap fix to this is to create a distance between any two metrics, but not embed them as vectors. We let the distance between benchmarks A and B be the average (absolute) difference in Z scores across all models that are tested on both A and B. This way, we can calculate distances between all benchmarks as long as every pair of benchmarks share at least one model which was tested on both of them. The Epoch data does not have this property to start with, but after we cut 3 benchmarks with very little data and greedily cut 20 benchmarks with very low connectivity (which tend to be older benchmarks and/or quite niche), we end up with a set of 27 benchmarks that we can analyze.We apply three methods of clustering: hierarchical, KMeans, and spectral. Then we split into k clusters, where we test k = 2 through k = 5. When k = 2 the clusters are too large to say much across all three methods, so we restrict to k = 3 through 5. The best silhouette score, which is a measure of how self contained the clusters are, is achieved across methods when k=4, so that is what I will write about here. You can play with the data yourself at this repo.Four ClustersThe four clusters we get for each of the three methods. The MDS (multidimensional scaling) axes are a way of trying to put the points near their true distances while forced into 2D space.1: Economically Useful Work, and also Chess PuzzlesWe can see in the figure above that the bottom right section clusters together across all three methods. That cluster is:apex_agents_external, which tests whether the model can complete multi-step tasks across tools and APIsgdpval_external, OpenAI’s benchmark of model performance on economically helpful tasksswe_bench_verified, a human verified subset of hard SWE problemsterminalbench_external, where an AI is asked to manage complex tasks using only the command linewebdev_arena_external, a human-judged benchmark of which model builds a better website from a promptchess_puzzles, which is a set of chess positions and the AI must find the optimal moveMy best guess for why Chess clusters with this group of benchmarks targeted more at agency and economically valuable work is that Chess puzzles, and the rest of this group, are loaded on multi step reasoning. Perhaps models which are good at considering the repercussions of some action excel at both Chess and breaking into secret files. [4]2: Hard, but not research level, reasoning benchmarksThis is the cluster in the top right of the figure, which consistently contains the following five benchmarks:epoch_capabilities_index, a unified measure of model performance across benchmarksgpqa_diamond, a set of questions on biology, chemistry, and physics at a level of PHD expertisemath_level_5, a set of the hardest competition math problems from the AMC and AIMEotis_mock_aime_2024_2025, competition problems which are even harder than math_level_5, but not yet research levelgeobench_external, which measures how good models are at GeoGuessr.Why is geobench here? One would think that GeoGuessr is a task that relies more on multimodality and vision ability more than anything else, and this is backed up by the fact that Gemini consistently scores quite well on it. This might just be noise, but if anyone has theories inspired by this cluster feel free to leave a comment.3: Factual Accuracy and Information Synthesisdeepresearchbench, where models are tasked with gathering specific information from varied sources of datasimpleqa_verified, a set of questions with factual answers (e.g. “On what day, month, and year was Algerian artist Mohammed Racim born?”)lech_mazur_writing, a rubric graded writing benchmark meant to be “a proxy for real-world writing assistance quality” [5]Once again, there is an outlier in the group. Why does a writing benchmark fit in with two benchmarks that so clearly test ability to state facts? Well, the Lech Mazur rubric emphasizes “controllability, content planning, and fluency over longer outputs,” so perhaps this sort of writing asessment is more reliant on correctly gathering data and being anti-hallucinatory than it is on style or creativity. Indeed, a major part of the grading is the inclusion of a laundry list of required elements, which makes this cluster a little more cogent. See an example in the figure below.From here4: The Leftover FrontierOur final cluster is the least unified. It has several benchmarks on high level math and reasoning:hlefrontiermathIt also contains several benchmarks which are meant to be easy for humans but hard for models:arc_agi_1 and arc_agi_2, which ask models to follow patterns of how grids are filled invpct, where a model predicts which bucket a ball dropped onto some ramps will fall intosimplebench, a set of tricky multiple choice questions that humans find somewhat easy but models can mess up due to extraneous information and red herringsThe remaining bunch seem to focus on long term reasoning and agency, much like group 1:gso, the aforementioned “collection of software performance optimization challenges”metr_time_horizons_external, which measures the performance of models on tasks that take humans hours or moreswe_bench_bash, where models perform on hard SWE tasks with only access to the terminalweirdml_external, a set of coding tasks for machine learning that don’t follow common patterns of coding tasksWe can’t naively solve this confusing group by clustering into 5 instead of 4; this cluster does indeed split, but HLE ends up apart from FrontierMath, and GSO and Metr Time Horizons split from swe_bench_bash and weirdml. I don’t have a better theory for why this is a cluster except that its the “leftovers,” and that these are all modern, difficult benchmarks which stump old models.Grain of SaltThere are very real limitations to this method of grouping benchmarks. Most notably, for many benchmark pairs, there are only a few models which are tested on both of them. This makes our distance metric noisy, and if we had the data for more benchmark/model pairs this could make a difference. One method to test the legitimacy of the clusters in the future would be to perturb the distance between benchmarks, say as a function of the variance of their difference across different models, and see if the same clusters remain.Moreover, we get different results with our different clustering methods. We tried to focus here on the groupings which were robust across the three methods, and as such we did not end up discussing all of the benchmarks which were actually analyzed.This was, more than anything, an experiment. While the results are interesting, I don’t have high confidence that they are built to last. Still, I think there should be more of a concerted effort to develop quantitative methods that group benchmarks by the distinct types of AI capabilities they track. We’ve seen again and again that performance on a few cherrypicked benchmarks can be gamed. On the other hand, when we scroll through a table of everything from research level mathematics to GeoGuessr scores, we risk losing track of what exactly it is that we are measuring.^https://agentsea.github.io/osuniverse/^https://epoch.ai/benchmarks/gso^https://epoch.ai/frontiermath/tiers-1-4/about^https://www.tbench.ai/registry/terminal-bench-core/head/crack-7z-hash^https://epoch.ai/benchmarks/lech-mazur-writingDiscuss Read More

Related Posts

[Hiring] Principia Research Fellows

The Thinking Machine

LessWrong’s UX may not be living up to its ideas

Leave a Reply Cancel reply