IntroductionSecure program synthesis (SPS) is the problem of automatically generating, or “synthesizing”, software which is known to be secure. In practice, this manifests as the triple-task of synthesizing some software S, a specification φ saying what it would mean for S to be “secure”, and a proof P that S satisfies φ. Depending on your formalism, you might write something impressive-looking like:P ⊢ ( S ⊨ φ ) … meaning that P proves (⊢) that S satisfies (or “models”, ⊨) the security specification φ. The word “secure” here is an informal descriptor of the kinds of qualities enforced by φ, which could be things like, “the program S is not subject to IDOR or SQL injection attacks” or “the smart-contract S does not allow double-spending”, or whatever.[1] That is to say, security is a social construct — there is no mathematical way to define “secure” versus “insecure” without first presuming to know which functions of a given program are desirable versus undesirable to the user.[2]Note that tons of engineers write, or vibe, code every day in Rust, which does provide some intrinsic formal guarantees. When we talk about SPS, we’re imagining something similar, but with much stronger guarantees like “the program eventually terminates”, and “the code is immune to SQL injection attacks” or “IDOR attacks”. The reason Rust doesn’t do this stuff today is because either those problems are undecidable, or we don’t yet have efficient decision procedures for them (more of the former than the latter). But, decidability stops being a concern when we can just whip the language model into proving the theorem.Until the advent of LLMs, program synthesis was an academic exercise studied by mostly Greek and Israeli professors with grey hair and chalkboards full of difficult-looking model and automata theory. The methods they proposed typically involved zooming in on a really specific kind of programming challenge, defining a mathematical search-space for solutions to said challenge, and then building search algorithms out of the standard arsenal of computer science tools from operations research, statistical learning theory, SAT/SMT, etc. These research projects could be very intellectually rewarding to work on, but due to their necessary and inherent constraints, rarely produced anything practically useful outside of some notable examples in chip design and the autofill feature in Excel.Moshe Vardi. Example of a classical program synthesis enthusiast.But recently, the landscape has very quickly changed, and the old guard of academics are at a complete loss for how to respond. It is suddenly extraordinarily easy to synthesize software, albeit, perhaps ridden with bugs and vulnerabilities. All you need is a prompt. And, it’s becoming pretty easy to synthesize proofs as well, due to the efforts of autoformalization companies like Harmonic, Axiom, Principia, Theorem, Math Inc, etc., as well as pioneering mathematicians and computer scientists such as Emily Riehl and Talia Ringer. One might think that programs + proofs = SPS. So, is secure program synthesis solved? Not yet. Through a combination of factors, and despite extraordinary expenditure by VCs and research agencies like DARPA and ARIA, the problem is still open for the taking. In this post, we explain the implications of solving SPS, why none of these companies have yet “solved” SPS, and what, in our view, one would need to accomplish in order to do so.What are the implications of solving SPS?SPS is a subset of the scalable formal oversight (SFO) research agenda, which aims to enforce boundaries and controls on LLM outputs by way of rigid, Bourbaki-like formality. The idea is basically, let’s imagine a world where vibe coding produces code that is more secure and more reliable than any human-written code, regardless of model alignment, because you can just keep writing down more logical properties you want the software to satisfy, and the system will keep modifying it to provably satisfy said properties. You can do some control-theory stuff like, imagining a smaller and more steerable model for property elicitation, and a powerful model for proof and code generation. Critically, the proofs are run through an interactive theorem prover (ITP), which is a totally deterministic software system with no AI in it that fully verifies the proofs are correct[3]. SPS is one potential vision for how we can achieve god-tier codegen without god-tier negative ramifications.If you don’t believe in fast AI timelines and are not concerned about AI takeover risk or alignment, you should probably still care about SPS as a potential answer to the question of what we’re gonna do about slop software and its myriad social consequences (theft, property damage, etc.).Why isn’t SPS “solved”?Specification is hard.The first and biggest problem is that specification is hard. That is to say, writing down the security specification φ for a given program S is very challenging. Your specification will likely be too narrow (you considered SQL injection attacks, but not ROP attacks); have too strong of assumptions (you incorrectly assumed the semantics of the chip were fixed); etc. (Sometimes this issue is referred to as elicitation.)For an example of this problem, we encourage you to go read the recent Symbolic Software evisceration of libcrux, a formally verified cryptographic library. It turns out the properties which were formally verified in that case were pretty narrow, and the library was riddled with serious vulnerabilities unrelated to those properties.Note: AI systems don’t magically solve this. They might help, but even if we achieved alignment with societal values, that won’t answer every design question in exactly the way you might like. If it did, it’d imply that every disagreement between an Engineering Manager and Product Manager meant one of them was misaligned with societal values!Solution: We need to build tools for specification elicitation. Convergent Research, the home of the Lean FRO, has a nascent FRO focusing on exactly this problem (doing initial work and fundraising for the full FRO). But, broadly speaking, we consider this to be an extraordinarily challenging issue and one that none of the autoformalization companies are (as far as we can tell) working on. One can imagine many solutions — agents that infer formal properties from code; interfaces for humans to click around a codebase and write down properties they expect the code to satisfy; fuzzers that yield diverse behaviors and then ask the human to say which seem “good” versus “bad” and why (and then infer properties from this feedback); and so forth. We need to build a huge variety of possible solutions, and then user-test them and see what sticks. This is only partially a technical FM/PL problem, and predominately an HCI problem.We also need to figure out how strong specifications need to be. Specifying specific security characteristics is a start, but we can go further. If we really don’t trust the LLM, do we want to specify every aspect of a program (to the point where the LLM becomes a fancy search algorithm that looks for the single possible solution), which John Regehr calls “zero-degree of freedom”? Of course, this poses the additional (theory) question of how we can make sure there really was only a single solution (modulo some notion of equivalence). Maybe we’ll want a hybrid where we’ll fully specify certain parts of a program (e.g. the part that handles credit card data), but leave some others (like the UI design) up to the LLM? An issue there is, again, that anything could be a security issue, famously even UI design. It’s best to explore a wide variety of these systems and see which ones work best both from a security and developer UX perspective, but they all come with individual issues both from a tooling and a theory perspective.Math proofs ≠ software proofs.Our friend and coauthor on this post Quinn Dougherty is fond of saying that the math companies intend to eventually make money by way of (waves hands) “something something Curry-Howard”. What he means by this is that the math companies believe that if they “solve” math (build agents that can prove most hard mathematical theorems of immediate interest), this will naturally imply a resolution to the SPS problem. We believe this may be incorrect, because software verification is substantially different from most mathematical inquiries; and that even if it is correct, SPS is so important and time-sensitive that it should be attacked directly rather than by proxy.One reason that math may not generalize to SPS is proof style. When you prove a mathematical theorem, the theorem is typically textually compact[4], and rests on a narrow set of definitions, but the ideas needed to resolve the proof can require deep subject matter expertise, deft traversal of seemingly unrelated ideas, and a significant amount of ingenuity. In contrast, when you perform software verification, typically the proof steps are pretty elementary (simple proofs by structural induction over various data types, for example[5]), but the proof statement involves a formalization of the entire codebase, meaning that the actual proof terms you have to wrangle are enormous[6]. Indeed, even keeping track of just the theorem statements for a software verification project requires significant context engineering (in the form of project structuring, documentation, theorem encapsulation, etc.). This gives rise to all kinds of context-window and proof-base-management issues which might not really show up at all in a pure mathematics context.[7] Also, in pure mathematics you can rely on a lot of standard definitions and theorems like those formalized in Lean’s Mathlib, whereas in software verification, pretty much every theorem is totally novel. Note, the Lean FRO is working on a software verification / CS analogue to Mathlib. We are skeptical that this will be helpful for verifying real-world software, because we rarely encounter the need for a library of pre-existing proofs when doing so, but we’d be thrilled to be wrong about this. We recently asked Emily Riehl what she thinks, and curiously await her response.Finally, you probably only need to prove higher-dimensional sphere packing once and it’s important enough that you can devote some resources to it and probably wait at least a few hours, if not days or weeks. In software, though, you’ll change your spec a lot (e.g. as part of a change spec-run-debug loop) and you’ll want to update the proof every time – so now you care a lot about not just speed and cost, but also latency! This will probably make you want to deploy techniques like incremental proofs that are less relevant in many pure math workflows. All these differences are considerable enough that we contend SPS should be tackled directly instead of by proxy, i.e, we can’t/shouldn’t just “solve math” and then wave our hands around and say “something something Curry Howard”.Solution: We need dedicated autoformalization companies and labs to build AI products[8] specifically for SPS, and not just assume it’ll flow from mathematics.Real-world software is extremely messy.Formal methods researchers tend to study really impressive sounding mathy targets, such as software for high-speed trading, distributed systems, digital voting, or silicon chip designs. This is for three reasons.Researchers can motivate the use of cumbersome techniques by saying that the target is security-critical, e.g., the ramifications of a bug or vulnerability in voting software are the undermining of democracy itself. This motivation matters a lot less in the glorious LLM future where both proofs and programs are cheap.The correctness and security properties for these kinds of systems are generally clear-cut. E.g., every transaction in the high-speed trading platform needs to clear, as intended, within some narrow temporal delta, and do so precisely once. In contrast, defining “φ” for something like “Flappy Bird” or “Apple Notes” or “Zoom” is considerably more challenging.These kinds of systems typically have minimal reliance on external libraries or services. They are, or are close to, closed systems. In contrast, real world software typically has a frontend, a backend, and a database, and the backend connects to a myriad of external services such as an LLM provider, a logging service, an agent evals tool, an internal dashboards tool, a payments processor, and so forth.Here is the conclusion we want you to take away from this section: although verifying or synthesizing a “nuclear reactor controller” or a “cryptographically secure voting system” may sound impressive, in reality, it is both more impressive and more difficult (by a couple orders of magnitude) to apply formal methods techniques such as SPS to normie software like “Threes” or “Slack”. These normie software stacks are more challenging to generate and verify because they involve more external services and layers of abstraction compared to the kinds of elegant, mathematical software that theorists get off on, and because it is less clear-cut what it means for them to be secure and correct.[9] It is also a much better business strategy to target normie software. Not only is the amount of normie software production much larger than the quantity of safety-critical software production, but moreover, safety-critical stuff is typically already built by domain experts who have their own preferred safety tools and approaches (which could be informal, yet still effective) and so you’re selling bread to bakers. Put differently, “we’re gonna sell SPS to the folks who build the quantitative trading software” is just dumb. Those people were recruited by Jane St because they’re just as good as, if not better than, you and me at this stuff. They’re already doing high assurance software engineering of some sort.Solution: We need high quality SPS benchmarks and projects, covering a huge diversity of software systems. We need to build the tech to synthesize secure variants of all software — not just the fancy sounding stuff, but also things like Angry Birds or Apple Notes or Slack.Proofs don’t pay for themselves (yet).If you want to convince your manager to let you use Claude Code, you can simply argue that Claude Code allows you to code 5-10x faster[10], meaning your manager gets 5-10x more value per dollar spent (minus the token-cost, which is not that high for most software engineers outside of the power-users in San Francisco). But what if you want to convince your manager to let you use Noether Code, the fictional analogue to Claude Code we are inventing for the purpose of this essay, which performs SPS? Then you need to justify the additional cost of generating proofs that your programs are correct, not to mention the human and token cost of going back and fixing your programs if they’re incorrect (which they sometimes will be). And to what end? So you can avoid a security vulnerability or some downtime a year from now? “Look dude, I’m running a four-person startup here. Those are champagne problems. I fucking hope we’re around in a year to worry about some security vulnerability or some downtime. Just use Claude Code and double check its work, please,” retorts your manager.It doesn’t need to be this way. Formality doesn’t need to be a “vitamin”.Specification is a form of planning. This is why “a sufficiently detailed spec is code” — because if your plan is detailed enough to be unambiguous, then it’s isomorphic to the actual code you’re trying to build. We therefore envision a kind of subagent architecture, where each agent has a formal spec the code it produces must satisfy, leading to successive (recursive) layers of disambiguation (one might say refinement) and eventually, correct code. Something like an RLM but with FM in-the-loop. Our pet theory is that this is how you make SPS better than raw-dog vibe-coding, where “better” is multidimensional and also includes things like “time from prompt to working code” that impact the immediate bottom line.Also, specs are good documentation. This is because, unlike comments and other forms of docs, they don’t get outdated (or, when they do, you notice). This may help with some of the problems that purely agentic approaches are currently facing. For instance, agents are still having issues understanding large codebases: famously, METR’s study on developer productivity showed that, when contributing to large open-source projects, Claude Code (as of early 2025) slowed experienced developers down – supervising the agent became so difficult that it more than outweighs any productivity gains. If these highly complex programs came with specs, it would be easier for both agents and humans to understand them, and agents would have a tighter reward loop, achieving better results with less human supervision.Solution: We need to examine the affordances provided by formal methods from a productivity standpoint, and determine which of them could solve problems currently faced by mutli-agent vibe-coding architectures. We need to build agentic coding tools that leverage formality to work faster and not just better.Who’s working on this? How do I get involved?We are working on building a list of awesome SPS projects, initiatives, papers, etc. Please contribute!Many smart people are starting to think about SPS. Ziran Yang at Princeton is working on this! If you’re an academic or a Princeton student, you should probably shoot Ziran an email. Talia Ringer (linked previously) and Icil Dillig are both working on SPS or SPS-adjascent-stuff. Apparently John Regehr is thinking about this stuff (linked previously in this piece). Ferenc Huszár is working on a version of SPS at Reasonable. If you’re interested in joining something like a neolab focused on a version of this problem, you should hit him up. Jesse Han of Morph and Math Inc has hinted to us that he’s working on SPS in some form, as has Carina Hong of Axiom. We don’t know to what degree their respective visions thereof align with ours or with one another’s. Finally, as alluded to previously, Mike Dodds of Galois is building a new FRO through Convergent Research to tackle the “specs are hard” problem.Quinn and Max (authors on this post) run an SPS group chat on Signal. You can reach out to them by email or other electronic mechanism if you’re interested in joining and actively working on the topic. We are also working with Apart and potentially Spar on organizing an SPS hackathon and fellowship. Currently, we are looking for a funding source for fellow stipends; if you work at a giving organization and think your org might be interested in this, please let us know!ConclusionSecure program synthesis is hard, and important. Some people are working on it, but they may be approaching the problem suboptimally and in an uncoordinated manner. We should approach it optimally and with improved coordination. We are working on community-building in the space and we want you to get involved.Acknowledgments. Thank you to GPT-5.4-Pro for the images and for the Secure Program Synthesis Signal group-chat for insightful discussion and feedback.^A common misconception is that the property, φ, could be drafted independently of the program, S. This is false. The property is defined over the atoms of the program — that is, over its observable behaviors. Accordingly, promises you may have seen in the literature or in grant proposals to, e.g., “collect a library of useful safety properties that can be re-used across programs/protocols/etc.”, are in general misleading. One can, in theory, define hyperproperties that do make sense across many different programs, but even these are pretty challenging to formalize in a very useful manner. Thus, the “triple task” we describe really does need to be tackled, in a sense, “all at once”.^Note, we believe that there are no “harmless” incorrect behaviors, or put differently, that the distinction between what it means for a software to be “secure” versus what it means to be “correct” is an inane one. I (Max) wrote about this idea quite a bit in my dissertation. The reasoning is that generally, attacks are built out of primitives, which individually appear to be basically harmless but unintended software behaviors. (See this example…) Thus, to be secure, software must be correct. Conversely, it’s hard to imagine anyone arguing a piece of software which has glaring security holes could be called “correct”. The reason we include the term “secure” in the title and introduction to this piece is because security is the goal we personally care about. We don’t think people who code, or vibe-code, care enough about security; we believe security holes lead to negative societal outcomes; and thus we are pushing for technological changes to improve security across-the-board.^You may even have a static proof-export capability, allowing you to audit proofs using more than one ITP.^There are some caveats to this comparison, e.g., P-Adic Hodge Theory, where the definitional load can be quite high.^I (Max) can think of one very funny exception to this. Once when verifying a property of TCP, I needed to prove that the limit of , for in [0, 1), as N → ∞, is 0. This turned out to be extremely challenging in ACL2, which does not have a theory of irrational numbers! But other than this example, most software verification proofs I’ve done have not meaningfully related to known mathematical theorems.^See my (Max’s) work with Ankit Kumar on GossipSub for a good example of this: http://github.com/gossipsubfm ^Counterpoint: Math Inc’s Sphere Packing proof completion took hundreds of thousands of lines of proof code, implying they probably did face the kinds of problems we describe with context. So maybe we’re wrong here!^There is some debate at the moment about the degree to which the distinction between the model and the scaffold makes sense today, and will make sense in the future. When we say “products”, we mean agents, models, scaffolds, etc. All of the above.^These are the reasons why the company I (Max) founded, Benchify (YC S24), pivoted away from formal-methods based code review, which was our initial product.^There is some evidence to the contrary. That discussion is outside the scope of this post.Discuss Read More
How to Solve Secure Program Synthesis
IntroductionSecure program synthesis (SPS) is the problem of automatically generating, or “synthesizing”, software which is known to be secure. In practice, this manifests as the triple-task of synthesizing some software S, a specification φ saying what it would mean for S to be “secure”, and a proof P that S satisfies φ. Depending on your formalism, you might write something impressive-looking like:P ⊢ ( S ⊨ φ ) … meaning that P proves (⊢) that S satisfies (or “models”, ⊨) the security specification φ. The word “secure” here is an informal descriptor of the kinds of qualities enforced by φ, which could be things like, “the program S is not subject to IDOR or SQL injection attacks” or “the smart-contract S does not allow double-spending”, or whatever.[1] That is to say, security is a social construct — there is no mathematical way to define “secure” versus “insecure” without first presuming to know which functions of a given program are desirable versus undesirable to the user.[2]Note that tons of engineers write, or vibe, code every day in Rust, which does provide some intrinsic formal guarantees. When we talk about SPS, we’re imagining something similar, but with much stronger guarantees like “the program eventually terminates”, and “the code is immune to SQL injection attacks” or “IDOR attacks”. The reason Rust doesn’t do this stuff today is because either those problems are undecidable, or we don’t yet have efficient decision procedures for them (more of the former than the latter). But, decidability stops being a concern when we can just whip the language model into proving the theorem.Until the advent of LLMs, program synthesis was an academic exercise studied by mostly Greek and Israeli professors with grey hair and chalkboards full of difficult-looking model and automata theory. The methods they proposed typically involved zooming in on a really specific kind of programming challenge, defining a mathematical search-space for solutions to said challenge, and then building search algorithms out of the standard arsenal of computer science tools from operations research, statistical learning theory, SAT/SMT, etc. These research projects could be very intellectually rewarding to work on, but due to their necessary and inherent constraints, rarely produced anything practically useful outside of some notable examples in chip design and the autofill feature in Excel.Moshe Vardi. Example of a classical program synthesis enthusiast.But recently, the landscape has very quickly changed, and the old guard of academics are at a complete loss for how to respond. It is suddenly extraordinarily easy to synthesize software, albeit, perhaps ridden with bugs and vulnerabilities. All you need is a prompt. And, it’s becoming pretty easy to synthesize proofs as well, due to the efforts of autoformalization companies like Harmonic, Axiom, Principia, Theorem, Math Inc, etc., as well as pioneering mathematicians and computer scientists such as Emily Riehl and Talia Ringer. One might think that programs + proofs = SPS. So, is secure program synthesis solved? Not yet. Through a combination of factors, and despite extraordinary expenditure by VCs and research agencies like DARPA and ARIA, the problem is still open for the taking. In this post, we explain the implications of solving SPS, why none of these companies have yet “solved” SPS, and what, in our view, one would need to accomplish in order to do so.What are the implications of solving SPS?SPS is a subset of the scalable formal oversight (SFO) research agenda, which aims to enforce boundaries and controls on LLM outputs by way of rigid, Bourbaki-like formality. The idea is basically, let’s imagine a world where vibe coding produces code that is more secure and more reliable than any human-written code, regardless of model alignment, because you can just keep writing down more logical properties you want the software to satisfy, and the system will keep modifying it to provably satisfy said properties. You can do some control-theory stuff like, imagining a smaller and more steerable model for property elicitation, and a powerful model for proof and code generation. Critically, the proofs are run through an interactive theorem prover (ITP), which is a totally deterministic software system with no AI in it that fully verifies the proofs are correct[3]. SPS is one potential vision for how we can achieve god-tier codegen without god-tier negative ramifications.If you don’t believe in fast AI timelines and are not concerned about AI takeover risk or alignment, you should probably still care about SPS as a potential answer to the question of what we’re gonna do about slop software and its myriad social consequences (theft, property damage, etc.).Why isn’t SPS “solved”?Specification is hard.The first and biggest problem is that specification is hard. That is to say, writing down the security specification φ for a given program S is very challenging. Your specification will likely be too narrow (you considered SQL injection attacks, but not ROP attacks); have too strong of assumptions (you incorrectly assumed the semantics of the chip were fixed); etc. (Sometimes this issue is referred to as elicitation.)For an example of this problem, we encourage you to go read the recent Symbolic Software evisceration of libcrux, a formally verified cryptographic library. It turns out the properties which were formally verified in that case were pretty narrow, and the library was riddled with serious vulnerabilities unrelated to those properties.Note: AI systems don’t magically solve this. They might help, but even if we achieved alignment with societal values, that won’t answer every design question in exactly the way you might like. If it did, it’d imply that every disagreement between an Engineering Manager and Product Manager meant one of them was misaligned with societal values!Solution: We need to build tools for specification elicitation. Convergent Research, the home of the Lean FRO, has a nascent FRO focusing on exactly this problem (doing initial work and fundraising for the full FRO). But, broadly speaking, we consider this to be an extraordinarily challenging issue and one that none of the autoformalization companies are (as far as we can tell) working on. One can imagine many solutions — agents that infer formal properties from code; interfaces for humans to click around a codebase and write down properties they expect the code to satisfy; fuzzers that yield diverse behaviors and then ask the human to say which seem “good” versus “bad” and why (and then infer properties from this feedback); and so forth. We need to build a huge variety of possible solutions, and then user-test them and see what sticks. This is only partially a technical FM/PL problem, and predominately an HCI problem.We also need to figure out how strong specifications need to be. Specifying specific security characteristics is a start, but we can go further. If we really don’t trust the LLM, do we want to specify every aspect of a program (to the point where the LLM becomes a fancy search algorithm that looks for the single possible solution), which John Regehr calls “zero-degree of freedom”? Of course, this poses the additional (theory) question of how we can make sure there really was only a single solution (modulo some notion of equivalence). Maybe we’ll want a hybrid where we’ll fully specify certain parts of a program (e.g. the part that handles credit card data), but leave some others (like the UI design) up to the LLM? An issue there is, again, that anything could be a security issue, famously even UI design. It’s best to explore a wide variety of these systems and see which ones work best both from a security and developer UX perspective, but they all come with individual issues both from a tooling and a theory perspective.Math proofs ≠ software proofs.Our friend and coauthor on this post Quinn Dougherty is fond of saying that the math companies intend to eventually make money by way of (waves hands) “something something Curry-Howard”. What he means by this is that the math companies believe that if they “solve” math (build agents that can prove most hard mathematical theorems of immediate interest), this will naturally imply a resolution to the SPS problem. We believe this may be incorrect, because software verification is substantially different from most mathematical inquiries; and that even if it is correct, SPS is so important and time-sensitive that it should be attacked directly rather than by proxy.One reason that math may not generalize to SPS is proof style. When you prove a mathematical theorem, the theorem is typically textually compact[4], and rests on a narrow set of definitions, but the ideas needed to resolve the proof can require deep subject matter expertise, deft traversal of seemingly unrelated ideas, and a significant amount of ingenuity. In contrast, when you perform software verification, typically the proof steps are pretty elementary (simple proofs by structural induction over various data types, for example[5]), but the proof statement involves a formalization of the entire codebase, meaning that the actual proof terms you have to wrangle are enormous[6]. Indeed, even keeping track of just the theorem statements for a software verification project requires significant context engineering (in the form of project structuring, documentation, theorem encapsulation, etc.). This gives rise to all kinds of context-window and proof-base-management issues which might not really show up at all in a pure mathematics context.[7] Also, in pure mathematics you can rely on a lot of standard definitions and theorems like those formalized in Lean’s Mathlib, whereas in software verification, pretty much every theorem is totally novel. Note, the Lean FRO is working on a software verification / CS analogue to Mathlib. We are skeptical that this will be helpful for verifying real-world software, because we rarely encounter the need for a library of pre-existing proofs when doing so, but we’d be thrilled to be wrong about this. We recently asked Emily Riehl what she thinks, and curiously await her response.Finally, you probably only need to prove higher-dimensional sphere packing once and it’s important enough that you can devote some resources to it and probably wait at least a few hours, if not days or weeks. In software, though, you’ll change your spec a lot (e.g. as part of a change spec-run-debug loop) and you’ll want to update the proof every time – so now you care a lot about not just speed and cost, but also latency! This will probably make you want to deploy techniques like incremental proofs that are less relevant in many pure math workflows. All these differences are considerable enough that we contend SPS should be tackled directly instead of by proxy, i.e, we can’t/shouldn’t just “solve math” and then wave our hands around and say “something something Curry Howard”.Solution: We need dedicated autoformalization companies and labs to build AI products[8] specifically for SPS, and not just assume it’ll flow from mathematics.Real-world software is extremely messy.Formal methods researchers tend to study really impressive sounding mathy targets, such as software for high-speed trading, distributed systems, digital voting, or silicon chip designs. This is for three reasons.Researchers can motivate the use of cumbersome techniques by saying that the target is security-critical, e.g., the ramifications of a bug or vulnerability in voting software are the undermining of democracy itself. This motivation matters a lot less in the glorious LLM future where both proofs and programs are cheap.The correctness and security properties for these kinds of systems are generally clear-cut. E.g., every transaction in the high-speed trading platform needs to clear, as intended, within some narrow temporal delta, and do so precisely once. In contrast, defining “φ” for something like “Flappy Bird” or “Apple Notes” or “Zoom” is considerably more challenging.These kinds of systems typically have minimal reliance on external libraries or services. They are, or are close to, closed systems. In contrast, real world software typically has a frontend, a backend, and a database, and the backend connects to a myriad of external services such as an LLM provider, a logging service, an agent evals tool, an internal dashboards tool, a payments processor, and so forth.Here is the conclusion we want you to take away from this section: although verifying or synthesizing a “nuclear reactor controller” or a “cryptographically secure voting system” may sound impressive, in reality, it is both more impressive and more difficult (by a couple orders of magnitude) to apply formal methods techniques such as SPS to normie software like “Threes” or “Slack”. These normie software stacks are more challenging to generate and verify because they involve more external services and layers of abstraction compared to the kinds of elegant, mathematical software that theorists get off on, and because it is less clear-cut what it means for them to be secure and correct.[9] It is also a much better business strategy to target normie software. Not only is the amount of normie software production much larger than the quantity of safety-critical software production, but moreover, safety-critical stuff is typically already built by domain experts who have their own preferred safety tools and approaches (which could be informal, yet still effective) and so you’re selling bread to bakers. Put differently, “we’re gonna sell SPS to the folks who build the quantitative trading software” is just dumb. Those people were recruited by Jane St because they’re just as good as, if not better than, you and me at this stuff. They’re already doing high assurance software engineering of some sort.Solution: We need high quality SPS benchmarks and projects, covering a huge diversity of software systems. We need to build the tech to synthesize secure variants of all software — not just the fancy sounding stuff, but also things like Angry Birds or Apple Notes or Slack.Proofs don’t pay for themselves (yet).If you want to convince your manager to let you use Claude Code, you can simply argue that Claude Code allows you to code 5-10x faster[10], meaning your manager gets 5-10x more value per dollar spent (minus the token-cost, which is not that high for most software engineers outside of the power-users in San Francisco). But what if you want to convince your manager to let you use Noether Code, the fictional analogue to Claude Code we are inventing for the purpose of this essay, which performs SPS? Then you need to justify the additional cost of generating proofs that your programs are correct, not to mention the human and token cost of going back and fixing your programs if they’re incorrect (which they sometimes will be). And to what end? So you can avoid a security vulnerability or some downtime a year from now? “Look dude, I’m running a four-person startup here. Those are champagne problems. I fucking hope we’re around in a year to worry about some security vulnerability or some downtime. Just use Claude Code and double check its work, please,” retorts your manager.It doesn’t need to be this way. Formality doesn’t need to be a “vitamin”.Specification is a form of planning. This is why “a sufficiently detailed spec is code” — because if your plan is detailed enough to be unambiguous, then it’s isomorphic to the actual code you’re trying to build. We therefore envision a kind of subagent architecture, where each agent has a formal spec the code it produces must satisfy, leading to successive (recursive) layers of disambiguation (one might say refinement) and eventually, correct code. Something like an RLM but with FM in-the-loop. Our pet theory is that this is how you make SPS better than raw-dog vibe-coding, where “better” is multidimensional and also includes things like “time from prompt to working code” that impact the immediate bottom line.Also, specs are good documentation. This is because, unlike comments and other forms of docs, they don’t get outdated (or, when they do, you notice). This may help with some of the problems that purely agentic approaches are currently facing. For instance, agents are still having issues understanding large codebases: famously, METR’s study on developer productivity showed that, when contributing to large open-source projects, Claude Code (as of early 2025) slowed experienced developers down – supervising the agent became so difficult that it more than outweighs any productivity gains. If these highly complex programs came with specs, it would be easier for both agents and humans to understand them, and agents would have a tighter reward loop, achieving better results with less human supervision.Solution: We need to examine the affordances provided by formal methods from a productivity standpoint, and determine which of them could solve problems currently faced by mutli-agent vibe-coding architectures. We need to build agentic coding tools that leverage formality to work faster and not just better.Who’s working on this? How do I get involved?We are working on building a list of awesome SPS projects, initiatives, papers, etc. Please contribute!Many smart people are starting to think about SPS. Ziran Yang at Princeton is working on this! If you’re an academic or a Princeton student, you should probably shoot Ziran an email. Talia Ringer (linked previously) and Icil Dillig are both working on SPS or SPS-adjascent-stuff. Apparently John Regehr is thinking about this stuff (linked previously in this piece). Ferenc Huszár is working on a version of SPS at Reasonable. If you’re interested in joining something like a neolab focused on a version of this problem, you should hit him up. Jesse Han of Morph and Math Inc has hinted to us that he’s working on SPS in some form, as has Carina Hong of Axiom. We don’t know to what degree their respective visions thereof align with ours or with one another’s. Finally, as alluded to previously, Mike Dodds of Galois is building a new FRO through Convergent Research to tackle the “specs are hard” problem.Quinn and Max (authors on this post) run an SPS group chat on Signal. You can reach out to them by email or other electronic mechanism if you’re interested in joining and actively working on the topic. We are also working with Apart and potentially Spar on organizing an SPS hackathon and fellowship. Currently, we are looking for a funding source for fellow stipends; if you work at a giving organization and think your org might be interested in this, please let us know!ConclusionSecure program synthesis is hard, and important. Some people are working on it, but they may be approaching the problem suboptimally and in an uncoordinated manner. We should approach it optimally and with improved coordination. We are working on community-building in the space and we want you to get involved.Acknowledgments. Thank you to GPT-5.4-Pro for the images and for the Secure Program Synthesis Signal group-chat for insightful discussion and feedback.^A common misconception is that the property, φ, could be drafted independently of the program, S. This is false. The property is defined over the atoms of the program — that is, over its observable behaviors. Accordingly, promises you may have seen in the literature or in grant proposals to, e.g., “collect a library of useful safety properties that can be re-used across programs/protocols/etc.”, are in general misleading. One can, in theory, define hyperproperties that do make sense across many different programs, but even these are pretty challenging to formalize in a very useful manner. Thus, the “triple task” we describe really does need to be tackled, in a sense, “all at once”.^Note, we believe that there are no “harmless” incorrect behaviors, or put differently, that the distinction between what it means for a software to be “secure” versus what it means to be “correct” is an inane one. I (Max) wrote about this idea quite a bit in my dissertation. The reasoning is that generally, attacks are built out of primitives, which individually appear to be basically harmless but unintended software behaviors. (See this example…) Thus, to be secure, software must be correct. Conversely, it’s hard to imagine anyone arguing a piece of software which has glaring security holes could be called “correct”. The reason we include the term “secure” in the title and introduction to this piece is because security is the goal we personally care about. We don’t think people who code, or vibe-code, care enough about security; we believe security holes lead to negative societal outcomes; and thus we are pushing for technological changes to improve security across-the-board.^You may even have a static proof-export capability, allowing you to audit proofs using more than one ITP.^There are some caveats to this comparison, e.g., P-Adic Hodge Theory, where the definitional load can be quite high.^I (Max) can think of one very funny exception to this. Once when verifying a property of TCP, I needed to prove that the limit of , for in [0, 1), as N → ∞, is 0. This turned out to be extremely challenging in ACL2, which does not have a theory of irrational numbers! But other than this example, most software verification proofs I’ve done have not meaningfully related to known mathematical theorems.^See my (Max’s) work with Ankit Kumar on GossipSub for a good example of this: http://github.com/gossipsubfm ^Counterpoint: Math Inc’s Sphere Packing proof completion took hundreds of thousands of lines of proof code, implying they probably did face the kinds of problems we describe with context. So maybe we’re wrong here!^There is some debate at the moment about the degree to which the distinction between the model and the scaffold makes sense today, and will make sense in the future. When we say “products”, we mean agents, models, scaffolds, etc. All of the above.^These are the reasons why the company I (Max) founded, Benchify (YC S24), pivoted away from formal-methods based code review, which was our initial product.^There is some evidence to the contrary. That discussion is outside the scope of this post.Discuss Read More


