Opinion

Can AI make advancements in moral philosophy by writing proofs?

​Cross-posted from my website.
If civilization advances its technological capabilities without advancing its wisdom, we may miss out on most of the potential of the long-term future. Unfortunately, it’s likely that that ASI will have a comparative disadvantage at philosophical problems.
You could approximately define philosophy as “the set of problems that are left over after you take all the problems that can be formally studied using known methods and put them into their own fields.” Once a problem becomes well-understood, it ceases to be considered philosophy. Logic, physics, and (more recently) neuroscience used to be philosophy, but now they’re not, because we know how to formally study them.
Our inability to understand philosophical problems means we don’t know how to train AI to be good at them, and we don’t know how to judge whether we’ve trained them well. So we should expect powerful AI to be bad at philosophy relative to other, more measurable skills.
However, there is one type of philosophy that is measurable, while also being extremely important: philosophy proofs.
Some examples of proofs that made important advances in moral philosophy:

The VNM Utility Theorem proved that any agent whose preferences satisfy four axioms must have a utility function, and their preferences entail maximizing the expected value of that function.
Harsanyi’s utilitarian theorem (see also Harsanyi (1955)
[1]
), which showed that if individuals have VNM utility functions, and if the Pareto principle
[2]
holds over groups, then a version of utilitarianism must be true. In particular, utility must aggregate linearly across individuals.
Arrhenius (2000)
[3]
proved that any theory of population ethics must accept at least one counterintuitive conclusion.
Askell (2018)
[4]
proved that if four intuitive axioms
[5]
hold, then it is impossible to compare infinite worlds.

I wrote a proof of my own in GiveWell’s Charity Recommendations Require Taking a Controversial Stance on Population Ethics.
The general pattern with these proofs is that you start from a set of intuitively reasonable axioms and use them to produce a controversial conclusion. Having that sort of proof doesn’t tell you whether you ought to reject one of the axioms or accept the conclusion, but it does tell you that you have to do one of those things.
Not many philosophical proofs have been written. That suggests that they’re difficult to write, or at least difficult to come up with. None of the proofs I listed are particularly complicated from a mathematical point of view—undergraduate math students routinely have to write more difficult proofs than those. The challenging part is identifying the right setup: you have to find a proof that tells you something new.
That’s the sort of thing that AI might be able to do well. AI can churn through ideas more quickly than humans can, and it’s relatively good at working with formal systems.
[6]
Modern-day LLMs might be smart enough to come up with useful philosophical proofs; even if not, the first AIs that can write these proofs will not need to be superintelligent.
AI won’t be good at telling you how to move forward after finding an impossibility proof; but it can give you the proof.
Proof of concept
A basic test would be to run a pro-tier LLM with extended thinking for a while to search through possibilities and try to come up with an interesting proof; then have human judges review the resulting proof(s). This test would be relatively easy to conduct; the hard part is judging whether the proofs are interesting.
As an even simpler test, I ran three Claude sessions to generate novel impossibility proofs. In each session I provided some guidance on what I was looking for, and I provided different guidance in each case to try to elicit three distinct results. Below is a quick summary of Claude’s three proofs, along with my assessments. I haven’t carefully verified that these proofs are correct, but they passed a quick sanity check.

First proof: We cannot escape Arrhenius’ impossibility result by introducing moral uncertainty.
My assessment: The concept is somewhat interesting, although to me it’s intuitively obvious that moral uncertainty wouldn’t let us get around Arrhenius’ result.

Second proof: If a pluralist value system cares about both maximizing welfare and mitigating individuals’ most severe complaints (similar to Rawls’ maximin principle), then the pluralist system either violates transitivity, or it can be collapsed onto a single scale.
My assessment: Uninteresting—the definition of “complaint minimization” does all the work in the proof, and the welfare-maximization criterion is irrelevant.

Third proof: Given five reasonable axioms of how an aligned AI agent ought to behave, it is impossible for an agent to simultaneously satisfy all five.
My assessment: Uninteresting—it’s a trivial special case of Sen (1970)
[7]
, which proved that no society can satisfy both Pareto efficiency and liberalism. If no society can satisfy those axioms, then clearly no aligned AI can satisfy them, either.

This was just a quick attempt; more work could perhaps elicit better proofs. Claude had a reasonable understanding of the limitations of its own proofs—it noticed (without additional prompting) that the second proof depended only on the definition of “complaint minimization”, and that the third proof was a special case of a known result.
A next step could be to ask many LLM instances to write dozens of proofs, and then use a manager LLM to filter down to the most interesting ones. At minimum, the manager should be able to filter out proofs that are trivial extensions of known results. With some additional effort, present-day LLMs might be capable of coming up with a good novel proof. If not, then it will likely be possible soon. Most kinds of moral philosophy might be difficult for AIs, but proofs are one area where AI assistance seems promising.
Is it risky to train AI on philosophy?
This post was about using pre-existing AI to write philosophy proofs, not about specifically training AI to get better at philosophy. I expect advanced AI to be relatively bad at (most kinds of) philosophy because philosophy is hard to train for.
However, it may be dangerous to train AI to get better at philosophy. My worry is that this would make AI better at persuading us of incorrect philosophical positions, and it would make misalignment harder to catch—precisely because it’s so hard to tell whether a philosophical position is correct.
I don’t have a strong view on how important this is, but I would be remiss if I didn’t talk about potential downsides. To be clear, I’m not proposing that we train AI to get better at philosophy. I’m proposing that perhaps near-future AI could be a useful assistant for writing formal philosophical proofs, and that this may be an important application of AI.
Harsanyi, J. C. (1955). Cardinal Welfare, Individualistic Ethics, and Interpersonal Comparisons of Utility. ↩︎

The Pareto principle states that if outcome A is at least as good as outcome B for every person, and outcome A is better for at least one person, then outcome A is better overall. ↩︎

Arrhenius, G. (2000). An Impossibility Theorem for Welfarist Axiologies. doi: 10.1017/s0266267100000249 ↩︎

Askell, A. (2018). Pareto Principles in Infinite Ethics. ↩︎

The four axioms are “the Pareto principle, transitivity, an axiom stating that populations of worlds can be permuted, and the claim that if the ‘at least as good as’ relation holds between two worlds then it holds between qualitative duplicates of this world pair”. ↩︎

I couldn’t have made this statement in 2023. LLMs used to be bad at formal systems, but they’ve gotten much better. ↩︎

Sen, A. (1970). The Impossibility of a Paretian Liberal. ↩︎

Discuss ​Read More

​Cross-posted from my website.
If civilization advances its technological capabilities without advancing its wisdom, we may miss out on most of the potential of the long-term future. Unfortunately, it’s likely that that ASI will have a comparative disadvantage at philosophical problems.
You could approximately define philosophy as “the set of problems that are left over after you take all the problems that can be formally studied using known methods and put them into their own fields.” Once a problem becomes well-understood, it ceases to be considered philosophy. Logic, physics, and (more recently) neuroscience used to be philosophy, but now they’re not, because we know how to formally study them.
Our inability to understand philosophical problems means we don’t know how to train AI to be good at them, and we don’t know how to judge whether we’ve trained them well. So we should expect powerful AI to be bad at philosophy relative to other, more measurable skills.
However, there is one type of philosophy that is measurable, while also being extremely important: philosophy proofs.
Some examples of proofs that made important advances in moral philosophy:

The VNM Utility Theorem proved that any agent whose preferences satisfy four axioms must have a utility function, and their preferences entail maximizing the expected value of that function.
Harsanyi’s utilitarian theorem (see also Harsanyi (1955)
[1]
), which showed that if individuals have VNM utility functions, and if the Pareto principle
[2]
holds over groups, then a version of utilitarianism must be true. In particular, utility must aggregate linearly across individuals.
Arrhenius (2000)
[3]
proved that any theory of population ethics must accept at least one counterintuitive conclusion.
Askell (2018)
[4]
proved that if four intuitive axioms
[5]
hold, then it is impossible to compare infinite worlds.

I wrote a proof of my own in GiveWell’s Charity Recommendations Require Taking a Controversial Stance on Population Ethics.
The general pattern with these proofs is that you start from a set of intuitively reasonable axioms and use them to produce a controversial conclusion. Having that sort of proof doesn’t tell you whether you ought to reject one of the axioms or accept the conclusion, but it does tell you that you have to do one of those things.
Not many philosophical proofs have been written. That suggests that they’re difficult to write, or at least difficult to come up with. None of the proofs I listed are particularly complicated from a mathematical point of view—undergraduate math students routinely have to write more difficult proofs than those. The challenging part is identifying the right setup: you have to find a proof that tells you something new.
That’s the sort of thing that AI might be able to do well. AI can churn through ideas more quickly than humans can, and it’s relatively good at working with formal systems.
[6]
Modern-day LLMs might be smart enough to come up with useful philosophical proofs; even if not, the first AIs that can write these proofs will not need to be superintelligent.
AI won’t be good at telling you how to move forward after finding an impossibility proof; but it can give you the proof.
Proof of concept
A basic test would be to run a pro-tier LLM with extended thinking for a while to search through possibilities and try to come up with an interesting proof; then have human judges review the resulting proof(s). This test would be relatively easy to conduct; the hard part is judging whether the proofs are interesting.
As an even simpler test, I ran three Claude sessions to generate novel impossibility proofs. In each session I provided some guidance on what I was looking for, and I provided different guidance in each case to try to elicit three distinct results. Below is a quick summary of Claude’s three proofs, along with my assessments. I haven’t carefully verified that these proofs are correct, but they passed a quick sanity check.

First proof: We cannot escape Arrhenius’ impossibility result by introducing moral uncertainty.
My assessment: The concept is somewhat interesting, although to me it’s intuitively obvious that moral uncertainty wouldn’t let us get around Arrhenius’ result.

Second proof: If a pluralist value system cares about both maximizing welfare and mitigating individuals’ most severe complaints (similar to Rawls’ maximin principle), then the pluralist system either violates transitivity, or it can be collapsed onto a single scale.
My assessment: Uninteresting—the definition of “complaint minimization” does all the work in the proof, and the welfare-maximization criterion is irrelevant.

Third proof: Given five reasonable axioms of how an aligned AI agent ought to behave, it is impossible for an agent to simultaneously satisfy all five.
My assessment: Uninteresting—it’s a trivial special case of Sen (1970)
[7]
, which proved that no society can satisfy both Pareto efficiency and liberalism. If no society can satisfy those axioms, then clearly no aligned AI can satisfy them, either.

This was just a quick attempt; more work could perhaps elicit better proofs. Claude had a reasonable understanding of the limitations of its own proofs—it noticed (without additional prompting) that the second proof depended only on the definition of “complaint minimization”, and that the third proof was a special case of a known result.
A next step could be to ask many LLM instances to write dozens of proofs, and then use a manager LLM to filter down to the most interesting ones. At minimum, the manager should be able to filter out proofs that are trivial extensions of known results. With some additional effort, present-day LLMs might be capable of coming up with a good novel proof. If not, then it will likely be possible soon. Most kinds of moral philosophy might be difficult for AIs, but proofs are one area where AI assistance seems promising.
Is it risky to train AI on philosophy?
This post was about using pre-existing AI to write philosophy proofs, not about specifically training AI to get better at philosophy. I expect advanced AI to be relatively bad at (most kinds of) philosophy because philosophy is hard to train for.
However, it may be dangerous to train AI to get better at philosophy. My worry is that this would make AI better at persuading us of incorrect philosophical positions, and it would make misalignment harder to catch—precisely because it’s so hard to tell whether a philosophical position is correct.
I don’t have a strong view on how important this is, but I would be remiss if I didn’t talk about potential downsides. To be clear, I’m not proposing that we train AI to get better at philosophy. I’m proposing that perhaps near-future AI could be a useful assistant for writing formal philosophical proofs, and that this may be an important application of AI.
Harsanyi, J. C. (1955). Cardinal Welfare, Individualistic Ethics, and Interpersonal Comparisons of Utility. ↩︎

The Pareto principle states that if outcome A is at least as good as outcome B for every person, and outcome A is better for at least one person, then outcome A is better overall. ↩︎

Arrhenius, G. (2000). An Impossibility Theorem for Welfarist Axiologies. doi: 10.1017/s0266267100000249 ↩︎

Askell, A. (2018). Pareto Principles in Infinite Ethics. ↩︎

The four axioms are “the Pareto principle, transitivity, an axiom stating that populations of worlds can be permuted, and the claim that if the ‘at least as good as’ relation holds between two worlds then it holds between qualitative duplicates of this world pair”. ↩︎

I couldn’t have made this statement in 2023. LLMs used to be bad at formal systems, but they’ve gotten much better. ↩︎

Sen, A. (1970). The Impossibility of a Paretian Liberal. ↩︎

Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *