Published on January 15, 2026 12:05 AM GMTEpistemic status: speculation with a mix of medium confidence and low
confidence conclusions.
I argue that corrigibility is all we need in order to make an AI
permanently aligned to a principal.
This post will not address how hard it may be to ensure that an AI is
corrigible, or the conflicts associated with an AI being corrigible to
just one principal versus multiple principals. There may be important
risks from AI being corrigible to the wrong person / people, but those
are mostly outside the scope of this post.
I am specifically using the word corrigible to mean Max Harms’ concept
(CAST).
Max Harms writes that corrigibility won’t
scale
to superintelligence:
A big part of the story for CAST is that safety is provided by wise
oversight. If the agent has a dangerous misconception, the principal
should be able to notice this and offer correction. While this might
work in a setting where the principal is at least as fast, informed,
and clear-minded as the agent, might it break down when the agent
scales up to be a superintelligence? A preschooler can’t really vet
my plans, even if I genuinely want to let the preschooler be able to
fix my mistakes.
I don’t see it breaking down. Instead, I have a story for how, to the
contrary, a corrigible AI scales fairly naturally to a superintelligence
that is approximately value aligned.
A corrigible AI will increasingly learn to understand what the principal
wants. In the limit, that means that the AI will increasingly do what
the principal wants, without need for the principal to issue
instructions. Probably that eventually becomes indistinguishable from a
value-aligned AI.
It might still differ from a value-aligned AI if a principal’s
instructions differ from what the principal values. If the principal
can’t learn to minimize that with the assistance of an advanced AI,
then I don’t know what to recommend. A sufficiently capable AI ought to
be able to enlighten the principal as to what their values are. Any
alignment approach requires some translation of values to behavior.
Under corrigibility, the need to solve this seems to happen later than
under other approaches. It’s probably safer to handle it later, when
better AI assistance is available.
Corrigibility doesn’t guarantee a good outcome. My main point is that I
don’t see any step in this process where existential risks are reduced
by switching from corrigibility to something else.
Vetting
Why can’t a preschooler vet the plans of a corrigible human? When I try
to analyze my intuitions about this scenario, I find that I’m tempted
to subconsciously substitute an actual human for the corrigible human.
An actual human would have goals that are likely to interfere with
corrigibility.
More importantly, the preschooler’s alternatives to corrigibility suck.
Would the preschooler instead do a good enough job of training an AI to
reflect the preschooler’s values? Would the preschooler write good
enough rules for a Constitutional AI? A provably safe AI would be a good
alternative, but the feasibility of that looks like a long-shot.
Now I’m starting to wonder who the preschooler is an analogy for. I’m
fairly sure that Max wants the AI to be corrigible to the most
responsible humans until some period of acute risk is replaced by a
safer era.
Incompletely Corrigible Stages of AI
We should assume that the benefits of corrigibility depend on how
reasonable the principal is.
I assume there will be a nontrivial period when the AI behaves
corrigibly in situations that closely resemble its training environment,
but would behave incorrigibly in some unusual situations.
That means a sensible principal would give high priority to achieving
full corrigibility, while being careful to avoid exposing the AI to
unusual situations. The importance of avoiding unusual situations is an
argument for slowing or halting capabilities advances, but I don’t see
how it’s an argument for replacing corrigibility with another goal.
When in this path to full corrigibility and full ASI does the
principal’s ability to correct the AI decrease? The principal is
becoming more capable, due to having an increasingly smart AI helping
them, and due to learning more about the AI. The principal’s trust in
the AI’s advice ought to be increasing as the evidence accumulates that
the AI is helpful, so I see decreasing risk that the principal will do
something foolish and against the AI’s advice.
Maybe there’s some risk of the principal getting in the habit of always
endorsing the AI’s proposals? I’m unsure how to analyze that. It seems
easier to avoid than the biggest risks. It still presents enough risk
that I encourage you to analyze more deeply than I have so far.
Jeremy Gillen says in The corrigibility basin of attraction is a
misleading gloss
that we won’t reach full corrigibility because
The engineering feedback loop will use up all its fuel
This seems like a general claim that alignment is hard, not a claim that
corrigibility causes risks compared to other strategies.
Suppose it’s really hard to achieve full corrigibility. We should be
able to see this risk better when we’re assisted by a slightly better
than human AI than we can now. With a better than human AI, we should
have increased ability to arrange an international agreement to slow or
halt AI advances.
Does corrigibility make it harder to get such an international
agreement, compared to alternative strategies? I can imagine an effect:
the corrigible AI would serve the goals of just one principal, making it
less trustworthy than, say, a constitutional AI. That effect seems hard
to analyze. I expect that an AI will be able to mitigate that risk by a
combination of being persuasive and being able to show that there are
large risks to an arms race.
Complexity
Max writes:
Might we need a corrigible AGI that is operating at speeds and
complexities beyond what a team of wise operators can verify? I’d give
it a minority—but significant—chance (maybe 25%?), with the chance
increasing the more evenly/widely distributed the AGI technology is.
I don’t agree that complexity of what the AGI does constitutes a reason
for avoiding corrigibility. By assumption, the AGI is doing its best to
inform the principal of the consequences of the AGI’s actions.
A corrigible AI or human would be careful to give the preschooler the
best practical advice about the consequences of decisions. It would work
much like consulting a human expert. E.g. if I trust an auto mechanic to
be honest, I don’t need to understand his reasoning if he says better
wheel alignment will produce better fuel efficiency.
Complexity does limit one important method of checking the AI’s
honesty. But there are multiple ways to evaluate honesty. Probably more
ways with AIs than with humans, due to our ability to get some sort of
evidence from the AI’s internals. Evaluating honesty isn’t obviously
harder than what we’d need to evaluate for a value-aligned AI.
And again, why would we expect an alternative to corrigibility to do
better?
Speed
A need for hasty decisions is a harder issue.
My main hope is that if there’s a danger of a race between multiple
AGIs that would pressure AGIs to act faster than they can be supervised,
then the corrigible AGIs would prioritize a worldwide agreement to slow
down whatever is creating such a race.
But suppose there’s some need for decisions that are too urgent to
consult the principal? That’s a real problem, but it’s unclear why
that would be an argument for switching away from corrigibility.
How plausible is it that an AI will be persuasive enough at the critical
time to coordinate a global agreement? My intuition that the answer is
pretty similar regardless of whether we stick with corrigibility or
switch to an alternative.
The only scenario that I see where switching would make sense is if we
know how to make a safe incorrigible AI, but not a corrigible AI that’s
fully aligned. That seems to require that corrigibility be a feature
that makes it harder to incorporate other safety features. I’m
interested in reading more arguments on this topic, but my intuition is
saying that keeping corrigibility is at least as safe as any
alternative.
Conclusion
Corrigibility appears to be a path that leads in the direction of full
value alignment. No alternative begins to look better as the system
scales to superintelligence.
Most of the doubts about corrigibility seem to amount to worrying that a
human will be in charge, and humans aren’t up to the job.
Corrigible AI won’t be as safe as I’d like during the critical path to
superintelligence. I see little hope of getting something safer.
To quote Seth
Herd:
I’m not saying that building AGI with this alignment target is a good
idea; indeed, I think it’s probably not as wise as pausing development
entirely (depending on your goals; most of the world are not
utilitarians). I’m arguing that it’s a better idea than attempting
value alignment. And I’m arguing that this is what will probably be
tried, so we should be thinking about how exactly this could go well
or go badly.
I’m slightly less optimistic than Seth about what will be tried, but
more optimistic about how corrigibility will work if tried by the right
people.
Discuss Read More
