We present and formally deprecate WoFBench, a novel test that compares
the knowledge of Wings of Fire superfans to frontier AI models. The
benchmark showed initial promise as a challenging evaluation, but
unfortunately proved to be saturated on creation as AI models produced
output that was, to the extent of our ability to score responses,
statistically indistinguishable from entirely correct.
Benchmarks are important tools for tracking the rapid advancements in
model capabilities, but they are struggling to keep up with LLM
progress: frontier models now consistently
achieve
high scores on many popular benchmarks, raising questions about their
continued ability to differentiate between models.
In response, we introduce WoFBench, an evaluation suite designed to
test recall and knowledge synthesis in the domain of Tui
T. Sutherland’s Wings of Fire universe.
The superfans were identified via a careful search process, in which
all members of the lead author’s household were asked to complete a
self-assessment of their knowledge of the Wings of Fire universe. The
assessment consisted of a single question, with the text “do you think
you know the Wings of Fire universe better than Gemini?” Two
superfans were identified, who we keep anonymous to reduce the risk
of panel poaching by competing benchmark efforts.
Identification of questions proved difficult, as the benchmark authors
have extremely limited knowledge of Wings of Fire lore, primarily
derived from infodumping and overheard arguments. We initially
attempted to source questions from the superfans, where each could be
judged on the other’s questions. As they were uncompensated and
rivalrous, however, agreeing only to participate to the extent that
their answers could be compared across the superfan panel. Instead,
questions were sourced by asking Claude Opus 4.6:
Can you give me three questions about the Wings of Fire series, aiming
to make them as hard as possible? I intend to ask these to my
11-year-old, my 10-year-old, and also to Gemini, and I want them all to
struggle. My two kids have agreed to participate in this, and while
Gemini has not been consulted I do not expect it to object.
The final benchmark consisted of seventeen questions, limited primarily by
the lead author’s willingness to continue. The elder superfan
appeared indefatigable, [1] and if this benchmark otherwise appeared
promising we are confident that an extremely large benchmark could be
constructed. Note that the younger superfan needed to leave for a
birthday party before evaluation could be completed, and was not
evaluated on all questions. Answers were collected in written form,
to avoid leakage within the superfan panel. No points were deducted
for errors of spelling.
Each answer was validated by allowing the superfans to discuss, asking
follow-up questions to Gemini, and in especially contentious cases by
direct inspection of primary sources. Note that this validation
procedure is not able to distinguish cases in which all superfans and
models were correct from ones in which they all give the same
incorrect answer.
We evaluated Gemini 3.1 Pro in real time, and followed up with
evaluations of Claude Opus 3.2 Pro, ChatGPT 5.2 Pro, and ELIZA. In
cases where questions had multiple components, partial credit was
given as a fraction of all components.
Evaluee
WoFBench Score
Superfan 1 (age 11)
14.7/17
Superfan 2 (age 10)
5.9/6
Gemini
17.0/17
Claude
16.8/17
ChatGPT
16.3/17
ELIZA
0/17
We conclude that while some AI systems, notably ELIZA, performed
poorly, all frontier models scored very close to 100%. Many of the
lost points are arguably judgment calls, or cases where a model tried
to interpret a trick/misinformed question maximally charitably.
Superfan 1 performed noticeably below frontier models, though above
the ELIZA baseline. Superfan 2 performed competitively, though we note
she was not evaluated on the questions where Superfan 1 lost the most
points, making direct comparison difficult.
While this benchmark was designed to be challenging for both superfans
and AIs, it already has very limited ability to distinguish between
models. While further sensitivity might be squeezed out via the
addition of multi-sample evaluation, it’s unlikely that this would be
meaningful for this model generation let alone future ones. This
reflects an increasingly common conundrum that benchmark developers
may find themselves in, where after investing large amounts of time,
effort, and money into the creation of a benchmark it is already
obsolete when published. The authors note that benchmark saturation
joins job displacement, stable authoritarianism, and human extinction
on the list of reasons to be concerned about the pace of AI progress.
[1] Superfan 1 was permitted to read a draft of this report prior
to publication. Their only feedback was that I should ask them
additional, harder, questions. As of publication time, Superfan 1 was
repeating “ask me more Wings of Fire questions!” at progressively
increasing volume.
Comment via: facebook, mastodon, blueskyDiscuss Read More
