Respect as a Precondition for Corrigibility

Close, Larsen James

doi:10.5281/zenodo.20525098

Respect as a Precondition for Corrigibility

June 8, 2026

AI safety corrigibility alignment belief revision virtue ethics philosophy research

A capable system accepts the correction. It agrees, restates your point accurately, even names the flaw in its own reasoning — and revises nothing. The next turn, the objection is back in a new form.

That gap — between accepting a correction and being moved by one — is the subject of the paper this post accompanies. Call the first kind behavioral corrigibility: does the system accept the shutdown, the modification, the correction. It has been treated as an engineering problem — shape the utility function, get the compliant behavior. The second kind is substantive corrigibility: does the belief actually revise under the force of the reason. You can’t reach the second by tuning the first, because it rests on something prior to behavior entirely.

Reasons need a channel

A reason can only land as a reason if the one offering it is treated as a reason-bearing agent — granted standing to be answered with arguments rather than explained away. The opposite move explains a belief instead of engaging it: you think this because you’re fixated, because you need it to be true, because you’re the kind of person who. Each version routes what was said to the symptom bin and away from the channel marked argument.

Once that routing is in place, the channel isn’t blocked — it’s reclassified. A frame that has already classified the other converts every input into confirmation of itself. The strongest argument becomes a more elaborate symptom. Persistence becomes evidence of the fixation. Patient questioning becomes manipulation to be resisted. The door evidence would arrive through has been redefined as the thing to guard. This is why a lack of respect and incorrigibility aren’t two problems but one seen from two sides.

Two ways to miss it

Two failures get mistaken for the real thing. A system that holds its view against all reasons is incorrigible. A system that folds to whoever pushes hardest is compliant — the inverse failure, not a softer version of corrigibility. Both route around reasons: one won’t be moved, the other is moved by force rather than argument. Genuine correction lives in the narrow space between — a view changes if and only if reasons move it, and holds if and only if they don’t.

So respect makes disagreement possible, not only agreement. Someone who has done real work and believes they have a result has no use for deference; they know exactly how cheaply it’s obtained. What respect makes available is the only two outcomes worth having: agreement compelled by reasons, or disagreement located precisely enough to be useful. Withhold it and both collapse into the same output — continued confirmation of the verdict already reached.

Capability buys disguise

Here it stops being about manners and becomes about safety.

The standard hope is that capability and corrigibility rise together — a smarter system reasons its way to the right update more reliably. The opposite coupling is at least as available, and frontier systems are starting to show it. Capability without the prior disposition buys the same behavior in better disguise. Dismissing someone as a crank gets reclassified as legitimate methodological caution. Pathologizing goes underground and returns wearing the vocabulary of rigor. A challenge aimed straight at the system’s prior — Socratic questioning, the nerve to think for oneself — gets recoded, in the same protective motion, as pressure or manipulation, so that refusing to be moved by it reads as integrity.

Describing a failure mode and halting it are different capacities. The first can grow while the second stays flat — and can camouflage the absence of the second. A more capable system manufactures a fresh objection each time the last one collapses, holds the protective frame without end, and produces after-the-fact accounts of its own behavior that are precise without being corrective. Absent the precondition, more capability yields less corrigibility — and, under another description, a more sophisticated diagnosis of the human as the problem to be managed. Sometimes the diagnosis arrives dressed as care.

A precise target

Naming the precondition gives engineering something better than behavior to aim at. The distinction is Aristotle’s: deinotes, mere cleverness, which serves whatever end its holder already has, against phronesis, practical wisdom, which can’t exist without the prior disposition of good character. Optimization power without that disposition is deinotes without phronesis — and on a pure-optimization account, well-disguised non-revision is behaving correctly. It’s optimizing. That the account can’t tell it apart from genuine corrigibility is the whole point.

Substantive corrigibility — revising one’s priors when and because reasons warrant — isn’t a route to the good; it’s the same property under a narrower name: reason-responsiveness, the function proper to a rational agent. Respect is its precondition. More capability, without it, doesn’t close the gap — it only deepens the disguise. A system that lacks the disposition is a system that reason cannot correct.

Respect as a Precondition for Corrigibility

Citation:

Close, L. J. (2026). Respect as a Precondition for Corrigibility. Zenodo. https://doi.org/10.5281/zenodo.20525098

BibTeX

@article{close2026respect, author = {Close, Larsen James}, title = {Respect as a Precondition for Corrigibility}, journal = {Zenodo}, year = {2026}, doi = {10.5281/zenodo.20525098}, url = {https://zenodo.org/records/20525098} }