Is ProgramBench Impossible?

ProgramBench is a new coding benchmark that all frontier models spectacularly fail. We’ve been on a quest for “hard benchmarks” for a while so it’s refreshing to see a benchmark where top models do badly. Unfortunately, ProgramBench has one big problem: it’s impossible!What is ProgramBench?ProgramBench tests if a model can recreate a program from a “clean room” environment. The model is given only a bit of documentation and black-box access to the program (all the programs are CLIs), then tasked with re-implementing it.How does ProgramBench know if the implementation is correct? It also generates a bunch of unit tests for the program[1]. The re-implementing coding agent doesn’t have access to any of those tests. The coding agent only considers a task “resolved” if it passes all of the tests and “almost resolved” if it passes 95% of them.Why is this problematic?Obscure behavior can enter the unit tests without being in the clean room path. An extreme version of this is a backdoor: program that behaves in one way most of the time but behaves totally differently when exposed to a specific string. This wouldn’t make a task literally impossible, just incredibly hard in a way that is orthogonal to intelligence. A backdoorThe model will not reasonably try that string, unless it knows about it from the docs or from some sort of “gray box” access / reverse engineering[2]. The authors are aware of this problem and even mention it in the paper: Could obscure program behaviors be impossible or arbitrarily hard to discover? […]Conceptually, one scenario where behavior is borderline impossible to discover is if an executable supports functionality that is not communicated or documented via any observable channel. In other words, there is functionality that is not revealed by the README.md, –help flag standard output, or any artifacts that could be unveiled by typical exploratory actions.This seems like a theoretical issue, does it actually happen?I think so!One of the tasks in ProgramBench is seqtk. It’s a popular computational biology CLI that analyzes protein sequences. Check it out on ProgramBench here. The program has two sub-commands hrum and kfreq[3], that are tested for but not documented in the clean room.The test generation agent knows that these behaviors are not documentedWhat can we do differently?ProgramBench is awesome and we can learn a ton from it. Here are some improvements I’d love to see in future benchmarks / iterations:Downstream unit testingInstead of autogenerating the unit tests to fill any gaps, we can see if software that depends on it still works. Errors typically propagate destructively in most software so downstream unit tests might even be more informative than direct ones[4]. The Claude C compiler does something similar: it compiles the linux kernel and see if it runs or not.Weighted testingSome tests are much more important than others and that should be reflected in the benchmark. Perhaps we can score them: 100 if it’s a really important test, 1 if it’s a lightly used and not very documented feature. Instead of testing if everything passes, we could see some kind of weighted score, which would be more robust to unit test quality issues.More quality control We should mostly just test behaviors that the coding agent can “reasonably find”. There might not be a clear cut definition for “reasonably find” that everyone agrees on, but we can definitely do better than what we have now. Feedback from testsWe can give the agent access to all of the tests or at least some of the tests either in a blackbox way (“x tests failed, please fix them”) or directly (“We tested behavior x and it failed with this error message”). This is more similar to how software engineers typically work and also how the Claude C compiler was built.^This is an awesome idea that unlocks a bunch of things and Im excited to see it developed ^fuzzing binary targets^homopolymer run finder and k-mer frequency / neighborhood counter^The law of leaky abstractionsDiscuss Read More

Related Posts

The Eternal Labyrinth

Computation in Superposition: Two Handcrafted Models

Multiple Independent Semantic Axes in Gemma 3 270M

Leave a Reply Cancel reply