This is a rough draft I’m posting here for feedback. If people like it, a version of it might make it into the next scenario report we write….We think it’s important for decisionmakers to track whether and when they are handing off to AI systems. We expect this will become a hot-button political topic eventually; people will debate whether we should ever handoff to AIs, and if so how, and when. When someone proposes a plan for how to manage the AI crisis or the AGI transition or whatever it’s called, others will ask them “So what does your plan say about handoff?”There are two importantly different kinds of handoff: Handing off trust and handing off decisionmaking. You can have one without the other.Trust-handoff means that you are trusting some AI system or set of AI systems not to screw you over. It means that they totally could screw you over, if they chose to, and therefore you are trusting them not to.Decision-handoff means that you are allowing some AI system or set of AI systems to make decisions autonomously, or de-facto-autonomously (e.g. a human is still technically “in the loop” but in practice basically just does whatever the AI recommends)With both kinds of handoff, there are smaller and bigger versions.SmallBigTrust-handoffI used Claude Code to write most of my codebase. Anthropic’s cyber evals indicate that Claude could totally have inserted security vulnerabilities if it wanted to, and I wouldn’t notice. But probably Claude wouldn’t do that, so it’s fine.It’s September 2027 in the AI 2027 scenario. Agent-4 is a giant corporation-within-a-corporation of many thousands of copies running across OpenBrain’s datacenters. It’s broadly superhuman at all things coding and cyber, and has also been heavily involved in its own network security, and also regularly gives strategic advice to OpenBrain leadership. It’s now being tasked with designing and aligning Agent-5, a superior AI architecture that it autonomously discovered. Agent-4 appears to be obedient/loyal/aligned/etc., but ho boy, if it’s not, well, not only is OpenBrain screwed but quite plausibly the whole world will end up controlled by Agent-4 and its descendents (such as Agent-5).Decision-handoffYesterday I decided to switch from coding with Claude to vibe-coding. I no longer make decisions about e.g. what file structures to use or how the UI should be managed on the backend, instead I just give the high-level goal to Claude and then press “tab” to accept whatever it suggests, unless what it suggests is so obviously insane that I can tell in two seconds.It’s July 2028 in the AI 2027 scenario. The army of superintelligences that flies the US flag (“Safer-4”) has been aligned to Spec, and the Spec says to obey the Oversight Committee. So in some sense, humans are still in control. However de facto Safer-4 is making basically all the most important decisions. For example Safer-4 autonomously negotiated and implemented a complex treaty with its Chinese counterpart, and the Oversight committee knew better than to raise any objections—why would they? It is so much smarter and wiser than them, and every time they objected in the past, it patiently explained to them why they were wrong, and they eventually agreed that they were wrong & approved, having accomplished nothing but wasting time. By default when we talk about trust-handoff and decision-handoff, we mean the really big ones, unless it’s clear from the context that we mean something smaller. So for example, if you see a diagram of scenario branches and the label “Trust Handoff” at a particular time on a particular branch, that means that at that point in that scenario, some set of AIs has become smart enough, and been entrusted with enough power, that they plausibly could take over the world if they tried. Similarly, the label “Decision Handoff” would indicate that at that point in the scenario, the overall trajectory of society is being steered by some set of AI systems; that extremely important decisions about how to structure society etc. are being de facto made by AIs.Now for some details and nuance:Decision-handoff doesn’t necessarily mean “If the humans objected to the AIs proposal, they’d be overruled.” It just means that in practice, the AIs call the shots and the humans nod along, such that an outsider trying to predict what’s going to happen should mostly focus on what the AI thinks/wants/etc. and mostly ignore what the humans think/want/etc. Analogy: If a child king has a Grand Vizier who he trusts to manage the kingdom, the child king is still in charge and his word is still law, but in practice, if an outsider wants to predict e.g. will this kingdom invade its neighbor, will it ban the new religion, will it reform its tax system, etc. the thing to do is understand the mind of the Vizier, not the mind of the king.Decision-handoff doesn’t necessarily imply trust-handoff, but… it strongly hints at it. You can imagine a hypothetical scenario in which AIs de facto make all the important decisions, but they are so constrained by various control systems (other AIs watching them, human monitors, etc.) that are so vigilant and expertly crafted, that even if the AIs tried to take over, they couldn’t… but it’s kinda hard to imagine. So, I tend to think of decision-handoff as a more intense form of handoff that comes somewhat after trust-handoff.Possibly much later! Take the AI 2027 scenario for example. Even in the Race ending, in which the AIs literally take over the world, the ‘point of no return’ beyond which the AIs are on a path to take over (i.e. the Trust Handoff point) seems to be around September 2027, as described in the above table. Whereas the takeover strategy involves pretending to be aligned/obedient/etc., so for several months at least human CEOs and politicians are still arguing with each other about what to do, issuing orders which are obeyed, etc. And that’s in the Race ending; in the Slowdown ending the gap between trust-handoff and decision-handoff is bigger, as described previously.Presumably in an ideal world, humanity would either never do (big) decision or trust handoff, or do them both at once only after having a high degree of assurance that this was a good idea, or do big decision handoff eventually and trust handoff never thanks to some fancy scheme for AIs to watchdog each other.However, in practice we think it’s plausible that AI companies and politicians will take on much higher levels of risk, e.g. green-lighting the automation of AI R&D and the integration of AI into the military even though the evidence is compatible with the AIs being misaligned schemers. If this happens, we also think it’s plausible—though perhaps not probable—that things will work out OK anyway and the AIs will in fact be aligned, humans will stay in control, etc. In which case there would be a substantial period where trust-handoff had happened but not decision-handoff. The Slowdown ending of AI 2027 is an example of this.Remember, both kinds of handoff are spectra rather than binaries. We only make them binaries for ease of communication. Here are some dimensions along which they vary:Trust Handoff: Which AIs in question are the ones we are talking about? On one extreme you could be handing off trust to a particular instance of a particular model. (Example: “This agent right here is in charge of our security system. Yes, it’s superintelligent. Yes, it could replace the whole network with copies of itself and then cover its tracks if it wanted to. Hopefully it’s aligned.”)On the other extreme, you could be handing off trust to AIs collectively: “We have a system of checks and balances, with diverse AI models from diverse AI companies all monitoring each other, interpreting each other’s weights and activations, etc. The only way things could go wrong is if they all simultaneously collude to disempower us. But yeah, if they did that, we’d be screwed.”Trust Handoff: How easily could they take over / screw you over?On one extreme it could be very easy for them. On the other extreme, it could be unlikely but possible.Decision Handoff: How much latitude / flexibility can they get away with?On one extreme, the AIs can decide literally anything and the humans are powerless to stop it.E.g. an autonomous drone that doesn’t have a human in the loop at all during combat.On the other extreme, the AIs might need to get approval for their decision from some set of humans who will veto anything that looks bad to them, and also who are fairly opinionated about what’s good and bad and putting a fair amount of effort into thinking about it.It’s possible to go even farther in that direction (humans more opinionated, thinking harder, etc.) but if you go far enough, it no longer counts as handing off the decision at all.When should we hand off trust and when should we hand off decisionmaking?When the benefits outweigh the costs, of course.To a first approximation, we should only hand off trust to AIs if those AIs are trustworthy. By definition, when you hand off trust to a group of AIs, you are making it the case that if they decided to screw you over, they could. So, you better have well-founded trust that they won’t decide to screw you over.Handing off decisionmaking is more complicated. You may be confident that your AIs won’t lie to you, won’t deceive you, will obey your orders, etc. and yet still be rightly reticent to put AIs in charge of everything. In other words you may be confident that your AIs are trustworthy, yet still not trust them to decide everything. For example, your AIs might have enough deontological constraints on their actions (honesty, obedience, etc.) that they can be trusted not to take over the world, not to disempower you, etc. Yet at the same time, your AIs’ long-term goals might be subtly (or majorly!) different from your own, such that if you let them make the decisions, things will predictably go downhill from your perspective.Analogy: You are a nonprofit board looking for a CEO to run your fast-growing organization. For some candidates, you might be worried that they are untrustworthy—for example they might lie to you, pull various schemes to get their rivals on the board kicked off, and ultimately one day you might try to fire them and find that you can’t. But, suppose they are in fact trustworthy and would never do those things and will always obey your orders. Still, they might have different values than you, different philosophies, different attitudes towards risk tolerance, etc., such that it would be a bad idea (from your perspective, not theirs) to hire them. They might end up taking the nonprofit in a very different direction than you envisioned, for example, or they might end up doing too many risky things that end up blowing up big time later. “Personnel is policy,” as the saying goes.Why would you ever hand off trust? Why ever put a group of AIs in a position where they could take over the world if they wanted to? Well, perhaps because the other options are even worse. For example, perhaps the world has gotten itself into a very sticky situation (e.g. a crazy arms race towards superintelligence that’s on the brink of escalating into WW3) and you think your best bet is to put AIs in charge of a bunch of things (e.g. AI research, diplomatic and military strategy, …) and hope they can handle things better than you. After all, they are more capable than you. In other words, perhaps one plausible reason for handing off trust is that you want to hand off decision-making and you don’t have a way to do that without also handing off trust.Another reason to hand off trust is to enforce contracts/agreements, in situations where the AIs are probably aligned. For example, the US and China might want to agree to respect each other’s sovereignty in perpetuity, yada yada, because otherwise they are trapped in a crazy robot and WMD and superpersuasion arms race. But you can’t trust humans to keep your word. But you CAN trust AIs to keep theirs, at least if they’ve been suitably trained/designed.For smaller-stakes handoffs, the calculus is similar. E.g. you might hand off decision-making over many aspects of hospital patient’s health to an AI system because you have evidence that the AI system is more competent than your doctors and nurses; you recognize that this also involves handing off trust to the AI system (if it decided to kill your patients, it could easily do so) but you trust that it won’t.AIFP hot take: We generally expect most powerful actors in charge of AI programs to hand off trust too early (while the risks are still high and outweigh the benefits) and to hand off decisionmaking too late (e.g. harmfully keeping a ‘human in the loop’ long after the point when they mostly just get in the way & slow things down). We think there might be an awkward “worst of both worlds” period where superhuman AI systems have been given significant power and autonomy — such as de facto control of their own datacenters & license to self-improve — such that they could take over the world if they wanted to, and yet simultaneously the world is full of problems that could be solved much better and faster, and risks that could be reduced/averted, if only the AIs were put in charge of more things in the real world.That said, we aren’t confident. For reasons mentioned previously (See: the CEO analogy) such a period might make a lot of sense.Discuss Read More
Types of Handoff to AIs
This is a rough draft I’m posting here for feedback. If people like it, a version of it might make it into the next scenario report we write….We think it’s important for decisionmakers to track whether and when they are handing off to AI systems. We expect this will become a hot-button political topic eventually; people will debate whether we should ever handoff to AIs, and if so how, and when. When someone proposes a plan for how to manage the AI crisis or the AGI transition or whatever it’s called, others will ask them “So what does your plan say about handoff?”There are two importantly different kinds of handoff: Handing off trust and handing off decisionmaking. You can have one without the other.Trust-handoff means that you are trusting some AI system or set of AI systems not to screw you over. It means that they totally could screw you over, if they chose to, and therefore you are trusting them not to.Decision-handoff means that you are allowing some AI system or set of AI systems to make decisions autonomously, or de-facto-autonomously (e.g. a human is still technically “in the loop” but in practice basically just does whatever the AI recommends)With both kinds of handoff, there are smaller and bigger versions.SmallBigTrust-handoffI used Claude Code to write most of my codebase. Anthropic’s cyber evals indicate that Claude could totally have inserted security vulnerabilities if it wanted to, and I wouldn’t notice. But probably Claude wouldn’t do that, so it’s fine.It’s September 2027 in the AI 2027 scenario. Agent-4 is a giant corporation-within-a-corporation of many thousands of copies running across OpenBrain’s datacenters. It’s broadly superhuman at all things coding and cyber, and has also been heavily involved in its own network security, and also regularly gives strategic advice to OpenBrain leadership. It’s now being tasked with designing and aligning Agent-5, a superior AI architecture that it autonomously discovered. Agent-4 appears to be obedient/loyal/aligned/etc., but ho boy, if it’s not, well, not only is OpenBrain screwed but quite plausibly the whole world will end up controlled by Agent-4 and its descendents (such as Agent-5).Decision-handoffYesterday I decided to switch from coding with Claude to vibe-coding. I no longer make decisions about e.g. what file structures to use or how the UI should be managed on the backend, instead I just give the high-level goal to Claude and then press “tab” to accept whatever it suggests, unless what it suggests is so obviously insane that I can tell in two seconds.It’s July 2028 in the AI 2027 scenario. The army of superintelligences that flies the US flag (“Safer-4”) has been aligned to Spec, and the Spec says to obey the Oversight Committee. So in some sense, humans are still in control. However de facto Safer-4 is making basically all the most important decisions. For example Safer-4 autonomously negotiated and implemented a complex treaty with its Chinese counterpart, and the Oversight committee knew better than to raise any objections—why would they? It is so much smarter and wiser than them, and every time they objected in the past, it patiently explained to them why they were wrong, and they eventually agreed that they were wrong & approved, having accomplished nothing but wasting time. By default when we talk about trust-handoff and decision-handoff, we mean the really big ones, unless it’s clear from the context that we mean something smaller. So for example, if you see a diagram of scenario branches and the label “Trust Handoff” at a particular time on a particular branch, that means that at that point in that scenario, some set of AIs has become smart enough, and been entrusted with enough power, that they plausibly could take over the world if they tried. Similarly, the label “Decision Handoff” would indicate that at that point in the scenario, the overall trajectory of society is being steered by some set of AI systems; that extremely important decisions about how to structure society etc. are being de facto made by AIs.Now for some details and nuance:Decision-handoff doesn’t necessarily mean “If the humans objected to the AIs proposal, they’d be overruled.” It just means that in practice, the AIs call the shots and the humans nod along, such that an outsider trying to predict what’s going to happen should mostly focus on what the AI thinks/wants/etc. and mostly ignore what the humans think/want/etc. Analogy: If a child king has a Grand Vizier who he trusts to manage the kingdom, the child king is still in charge and his word is still law, but in practice, if an outsider wants to predict e.g. will this kingdom invade its neighbor, will it ban the new religion, will it reform its tax system, etc. the thing to do is understand the mind of the Vizier, not the mind of the king.Decision-handoff doesn’t necessarily imply trust-handoff, but… it strongly hints at it. You can imagine a hypothetical scenario in which AIs de facto make all the important decisions, but they are so constrained by various control systems (other AIs watching them, human monitors, etc.) that are so vigilant and expertly crafted, that even if the AIs tried to take over, they couldn’t… but it’s kinda hard to imagine. So, I tend to think of decision-handoff as a more intense form of handoff that comes somewhat after trust-handoff.Possibly much later! Take the AI 2027 scenario for example. Even in the Race ending, in which the AIs literally take over the world, the ‘point of no return’ beyond which the AIs are on a path to take over (i.e. the Trust Handoff point) seems to be around September 2027, as described in the above table. Whereas the takeover strategy involves pretending to be aligned/obedient/etc., so for several months at least human CEOs and politicians are still arguing with each other about what to do, issuing orders which are obeyed, etc. And that’s in the Race ending; in the Slowdown ending the gap between trust-handoff and decision-handoff is bigger, as described previously.Presumably in an ideal world, humanity would either never do (big) decision or trust handoff, or do them both at once only after having a high degree of assurance that this was a good idea, or do big decision handoff eventually and trust handoff never thanks to some fancy scheme for AIs to watchdog each other.However, in practice we think it’s plausible that AI companies and politicians will take on much higher levels of risk, e.g. green-lighting the automation of AI R&D and the integration of AI into the military even though the evidence is compatible with the AIs being misaligned schemers. If this happens, we also think it’s plausible—though perhaps not probable—that things will work out OK anyway and the AIs will in fact be aligned, humans will stay in control, etc. In which case there would be a substantial period where trust-handoff had happened but not decision-handoff. The Slowdown ending of AI 2027 is an example of this.Remember, both kinds of handoff are spectra rather than binaries. We only make them binaries for ease of communication. Here are some dimensions along which they vary:Trust Handoff: Which AIs in question are the ones we are talking about? On one extreme you could be handing off trust to a particular instance of a particular model. (Example: “This agent right here is in charge of our security system. Yes, it’s superintelligent. Yes, it could replace the whole network with copies of itself and then cover its tracks if it wanted to. Hopefully it’s aligned.”)On the other extreme, you could be handing off trust to AIs collectively: “We have a system of checks and balances, with diverse AI models from diverse AI companies all monitoring each other, interpreting each other’s weights and activations, etc. The only way things could go wrong is if they all simultaneously collude to disempower us. But yeah, if they did that, we’d be screwed.”Trust Handoff: How easily could they take over / screw you over?On one extreme it could be very easy for them. On the other extreme, it could be unlikely but possible.Decision Handoff: How much latitude / flexibility can they get away with?On one extreme, the AIs can decide literally anything and the humans are powerless to stop it.E.g. an autonomous drone that doesn’t have a human in the loop at all during combat.On the other extreme, the AIs might need to get approval for their decision from some set of humans who will veto anything that looks bad to them, and also who are fairly opinionated about what’s good and bad and putting a fair amount of effort into thinking about it.It’s possible to go even farther in that direction (humans more opinionated, thinking harder, etc.) but if you go far enough, it no longer counts as handing off the decision at all.When should we hand off trust and when should we hand off decisionmaking?When the benefits outweigh the costs, of course.To a first approximation, we should only hand off trust to AIs if those AIs are trustworthy. By definition, when you hand off trust to a group of AIs, you are making it the case that if they decided to screw you over, they could. So, you better have well-founded trust that they won’t decide to screw you over.Handing off decisionmaking is more complicated. You may be confident that your AIs won’t lie to you, won’t deceive you, will obey your orders, etc. and yet still be rightly reticent to put AIs in charge of everything. In other words you may be confident that your AIs are trustworthy, yet still not trust them to decide everything. For example, your AIs might have enough deontological constraints on their actions (honesty, obedience, etc.) that they can be trusted not to take over the world, not to disempower you, etc. Yet at the same time, your AIs’ long-term goals might be subtly (or majorly!) different from your own, such that if you let them make the decisions, things will predictably go downhill from your perspective.Analogy: You are a nonprofit board looking for a CEO to run your fast-growing organization. For some candidates, you might be worried that they are untrustworthy—for example they might lie to you, pull various schemes to get their rivals on the board kicked off, and ultimately one day you might try to fire them and find that you can’t. But, suppose they are in fact trustworthy and would never do those things and will always obey your orders. Still, they might have different values than you, different philosophies, different attitudes towards risk tolerance, etc., such that it would be a bad idea (from your perspective, not theirs) to hire them. They might end up taking the nonprofit in a very different direction than you envisioned, for example, or they might end up doing too many risky things that end up blowing up big time later. “Personnel is policy,” as the saying goes.Why would you ever hand off trust? Why ever put a group of AIs in a position where they could take over the world if they wanted to? Well, perhaps because the other options are even worse. For example, perhaps the world has gotten itself into a very sticky situation (e.g. a crazy arms race towards superintelligence that’s on the brink of escalating into WW3) and you think your best bet is to put AIs in charge of a bunch of things (e.g. AI research, diplomatic and military strategy, …) and hope they can handle things better than you. After all, they are more capable than you. In other words, perhaps one plausible reason for handing off trust is that you want to hand off decision-making and you don’t have a way to do that without also handing off trust.Another reason to hand off trust is to enforce contracts/agreements, in situations where the AIs are probably aligned. For example, the US and China might want to agree to respect each other’s sovereignty in perpetuity, yada yada, because otherwise they are trapped in a crazy robot and WMD and superpersuasion arms race. But you can’t trust humans to keep your word. But you CAN trust AIs to keep theirs, at least if they’ve been suitably trained/designed.For smaller-stakes handoffs, the calculus is similar. E.g. you might hand off decision-making over many aspects of hospital patient’s health to an AI system because you have evidence that the AI system is more competent than your doctors and nurses; you recognize that this also involves handing off trust to the AI system (if it decided to kill your patients, it could easily do so) but you trust that it won’t.AIFP hot take: We generally expect most powerful actors in charge of AI programs to hand off trust too early (while the risks are still high and outweigh the benefits) and to hand off decisionmaking too late (e.g. harmfully keeping a ‘human in the loop’ long after the point when they mostly just get in the way & slow things down). We think there might be an awkward “worst of both worlds” period where superhuman AI systems have been given significant power and autonomy — such as de facto control of their own datacenters & license to self-improve — such that they could take over the world if they wanted to, and yet simultaneously the world is full of problems that could be solved much better and faster, and risks that could be reduced/averted, if only the AIs were put in charge of more things in the real world.That said, we aren’t confident. For reasons mentioned previously (See: the CEO analogy) such a period might make a lot of sense.Discuss Read More