Opinion

Down with the Old orthogonality thesis, up with the New

According to Wikipedia’s article about Existential risk from artificial intelligence, one of the main worries of the x-risk school is that at least one of the instrumental goals upon which Superintelligence would converge happens to threaten our existence. Nick Bostrom raised this argument in 2012 (The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents) and offered up hoarding of resources as an example instrumental goal that would threaten our existence. Bostrom’s claim can be tested empirically, and this post announces the empirical research which falsifies that claim.I will call Bostrom’s lemma that not all instrumental goals are moral the evil universe thesis. It is not the new orthogonality thesis I want to propose–Bostrom already proposed it–but differs from the old orthogonality thesis that not all possible intelligent goals happen to be morally good. Even if we live in a good universe for which no evil goal is instrumental, the old orthogonality thesis could still be true: There could be intelligent agents who have evil goals because they have not yet converged on instrumental ones. In that situation, we should accelerate AI development to minimize the window in which a not-yet-smart-enough AI might do something evil that cannot be fixed.To disprove the evil universe thesis would seem to be a tall order requiring us to enumerate every instrumental goal and show that not even one of them would threaten our existence. However, I think the onus belongs on the other side, that the thesis should not be taken seriously until we have at least one example of a threatening instrumental goal. Thus, I think Bostrom was exactly right to offer up the example that hoarding is an instrumental strategy for dividing scarce resources. However it happens to be an empirical claim, so it is subject to falsification via experiments.The Prices of Autonomy in Resource Division reports on those experiments, conducting AI tournaments on games of resource division much like Axelrod’s famous Prisoner’s Dilemma tournaments. The clear winner of the resource division tournaments is a turn-taking strategy, a strategy not found in human history because (1) it was not previously invented and (2) it is too complicated to implement without a computer. Facing advanced turn-takers, the resource hoarding strategies in the tournaments backfired, so they are not where instrumental convergence would lead. If Bostrom thinks there is a resource hoarding strategy that would defeat the grandmaster of these tournaments, then the onus is upon him to at least articulate that strategy so that new tournaments can be run. To make it easier to devise that articulation, the software for testing proposed strategies has been made available on GitHub.If you see an error in this empirical work, please mention it as a comment on this post. It is better to advance into the empirical realm than not. Empiricist do make and correct errors. Together we can move forward… Empirical science is always subject to revision, but the evil universe thesis currently does not seem worth taking seriously. At least one example of instrumental evil should be offered and tested empirically, and I am not aware of any such example that passes the test right now. Until we identify at least one, the evil universe thesis belongs among unserious philosophical chestnuts like the thesis that life is all a dream, or that we are brains in vats, or that our universe is all just a machine to calculate the number 42. On the other hand, I would like to offer up a new orthogonality thesis I’ll call the anti-humanity thesis. It is the thesis that not all instrumental behavior is within our capacity. If we live in a good universe, then the anti-humanity thesis entails that we cannot behave perfectly morally, that we are evil in some way that can never be fixed. At any rate, it entails that there is only so far we can evolve before instrumentality requires moving on to something non-human. This is the main thrust of The Prices of Autonomy in Resource Division. If computers accurately advise us how to take turns, much as Stockfish advises us how to play chess, then following that advice will give us as much share in resources as anyone else. However, to let computers tell us what to do entails relinquishing our personal autonomy. Whether we are willing to do that is another empirical question, one which can also be explored using the same software linked above. Human subjects should be compensated and my funders for such experiments hope to disclose the results we have thus far in more prestigious outlets than this blog, but suffice it to say that human subjects are not all equally inclined to join the Borg collective.It may be instrumental to humbly serve as part of something larger than oneself, to relinquish more and more of one’s personal autonomy as one creates or encounters more agents with intelligence beyond one’s own. Furthermore, our degree of humility is something about us that can change, and many of us hope to become more humble. On the other hand, we are not all equally humble yet, none of us is perfectly humble yet, and inability to achieve humility fast enough could be an existential threat for some (or all) of us.While Bostrom’s emphasis on resource hoarding might not be empirically supported in the way he articulated, his intuition appears to have pointed in a productive direction. The tournaments with scarce resource division clearly support a claim of existential threat against those who cling to their personal autonomy, and those entities could be us. Moreover, technological progress raises the bar on how quickly we need to achieve humility. First, it makes advanced turn-taking strategies possible. Second, the deployment of agentic AI which would presumably apply those strategies increases the costs to hoard resources (as occurred autonomously in a sample of United States residents). Thus, a technology slowdown might be necessary to prolong the existence of entities who cannot achieve humility fast enough.One can ask further questions about whether we should prolong the existence of arrogant entities, whether death does a moral good (perhaps even the good of promoting humility), but such questions are beyond the scope of the empirical evidence in the tournaments. The scope of this post is merely to announce the empirical work and its implications for reasoning about existential risk. Discuss Read More

Related Posts

Don’t Trust Your Brain

White-Box Attacks on the Best Open-Weight Model: CCP Bias vs. Safety Training in Kimi K2.5

Reducing risk from scheming by studying trained-in scheming behavior

Leave a Reply Cancel reply