Opinion

An Aphoristic Overview of Technical AI Alignment proposals

​Published on January 5, 2026 3:01 AM GMTMost alignment overviews are too long, but what if we rewrote one as a series of aphorisms?
I like Epictetus’s confronting style: abrasive, clarifying. See my fuller post for links and nuance.

I
Some problems can be solved by being smarter.
Some problems can only be solved by having help.
Aligning something smarter than you is the second kind.
So many proposals collapse to:
use AI to help supervise AI.
It sounds simple. It’s the only thing that works.

II. On using AI to supervise AI
If a weak model helps supervise a strong one,
and that one supervises a stronger one still:
this is Iterated Amplification.
The chain must hold.
If two models argue and we watch:
this is Debate.
Checking is easier than creating.
If we give the model principles and ask it to judge itself:
this is Constitutional AI.
Principles must survive power.
If we build AI to do alignment research:
this is Superalignment.
Train it. Validate it. Stress test it.

III. On not building one big AI
A corporation can be smarter than any employee.
Yet no employee wants to take over the world.
Many narrow tools, none with the full picture—
this is CAIS.

IV. On making the objective less wrong
Make the AI uncertain about what we want.
Then it must ask.
This is CIRL.
If it can’t be satisfied, it won’t stop.
Give it a notion of “enough”, and it can rest.
This is satisficing.

V. On control tools
If the model lies, look past the output.
Find what it knows, not what it says.
This is ELK.
If prompting fails, steer the internals.
This is activation steering.
Output-based evaluation breaks
when models are smarter than evaluators.
Look for the thought, not the performance.

VI. On understanding what we build
You can’t align what you don’t understand.
This is agent foundations.
What is an agent?
What is optimization?
Confused concepts make confused safety arguments.

VII. On older proposals
Keep it in a box. (AI boxing)
But boxes leak.
Let it answer, not act. (Oracle AI)
But answers shape action.
Do what we’d want if we were wiser. (CEV)
But how do we extrapolate?

VIII. Beyond alignment (bonus)
The optimist case: the problem is easier than it looks.
It still takes correction.
This is the optimist case.

IX. On who aligns the aligners (bonus)
Suppose we solve alignment perfectly.
Aligned to whom?
A safe AI in the wrong hands is still a problem.
This is governance risk.

Rewritten from my original draft with Claude.
Discuss ​Read More

​Published on January 5, 2026 3:01 AM GMTMost alignment overviews are too long, but what if we rewrote one as a series of aphorisms?
I like Epictetus’s confronting style: abrasive, clarifying. See my fuller post for links and nuance.

I
Some problems can be solved by being smarter.
Some problems can only be solved by having help.
Aligning something smarter than you is the second kind.
So many proposals collapse to:
use AI to help supervise AI.
It sounds simple. It’s the only thing that works.

II. On using AI to supervise AI
If a weak model helps supervise a strong one,
and that one supervises a stronger one still:
this is Iterated Amplification.
The chain must hold.
If two models argue and we watch:
this is Debate.
Checking is easier than creating.
If we give the model principles and ask it to judge itself:
this is Constitutional AI.
Principles must survive power.
If we build AI to do alignment research:
this is Superalignment.
Train it. Validate it. Stress test it.

III. On not building one big AI
A corporation can be smarter than any employee.
Yet no employee wants to take over the world.
Many narrow tools, none with the full picture—
this is CAIS.

IV. On making the objective less wrong
Make the AI uncertain about what we want.
Then it must ask.
This is CIRL.
If it can’t be satisfied, it won’t stop.
Give it a notion of “enough”, and it can rest.
This is satisficing.

V. On control tools
If the model lies, look past the output.
Find what it knows, not what it says.
This is ELK.
If prompting fails, steer the internals.
This is activation steering.
Output-based evaluation breaks
when models are smarter than evaluators.
Look for the thought, not the performance.

VI. On understanding what we build
You can’t align what you don’t understand.
This is agent foundations.
What is an agent?
What is optimization?
Confused concepts make confused safety arguments.

VII. On older proposals
Keep it in a box. (AI boxing)
But boxes leak.
Let it answer, not act. (Oracle AI)
But answers shape action.
Do what we’d want if we were wiser. (CEV)
But how do we extrapolate?

VIII. Beyond alignment (bonus)
The optimist case: the problem is easier than it looks.
It still takes correction.
This is the optimist case.

IX. On who aligns the aligners (bonus)
Suppose we solve alignment perfectly.
Aligned to whom?
A safe AI in the wrong hands is still a problem.
This is governance risk.

Rewritten from my original draft with Claude.
Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *