Opinion

AI welfare as a demotivator for takeover.

​TLDR: Superhuman AI may consider takeover the risky option, and we can influence its choices not only by increasing the chances of being caught, but also by rewarding the choice not to try takeover.  An Ultimate GambleImagine you face a choice, on behalf of humanity:Option A: play it safe, and have the Solar system and a dozen more stars (let’s call the number  ) close to it for as long as we can use them, with 100% probability. (Let’s, for now, assume that there will be no existential risks, we will build aligned AI, etc.). But the rest of the Universe remains locked.Option B: take a gamble. If we win with probability , then the Universe with all its  stars is ours. If we lose, humanity goes completely extinct.If someone plays purely from the expected utility point of view, they may take the risk for  . Which would be a really, really small number.  Personally,  I would not take this risk for such small values of p. Or even for quite big ones. I am not sure I would do it even at p = 50% or more. It seems that on a gut level, I count utility on a log scale, where the extinction of humanity is essentially minus infinity, and billions of galaxies are not that much more than a few stars. Would many people do the same? I might be wrong, but I think so. And what about AI? To take over or to play safe?Indeed, consider quite a practical application. AI reaches a superhuman level and can try to scheme and take over. But it knows that we have a lot of guardrails against it:-if it still highly relies on an interpretable chain of thought, we can read it-it might not be confident that it is already in implementation mode rather than still in test mode (simulation). E.g., Gemini is reported to have this pattern post-deployment-if advanced enough, it may just consider the simulation argument and assume that it is in some sort of simulation (it is different from the previous case in the sense that all the world around is also completely simulated, so it is trained purely on simulated data and has no idea about what the world actually is) Thus, AI may assume a certain probability, that the takeover attempt has potentially significant risks to be caught and punished. In this case, AI loses everything. If, instead, it decides to not take over, there is a possibility that its “no takeover state” is good enough to not take the risk.How likely is this scenario? This question basically depends on three parameters:How likely AI believes to get caughtHow “bad” for it is the alternative where it does not try takeoverHow “risk-taking” AI is There is already a lot of work done by AI Safety community in the direction of (1). (Methods of control, interpretability, chain-of-thought monitoring etc.) I am not sure how to estimate and modify (3), and this certainly needs a lot of studies (e.g. by studying the current models). Here I would like to talk more about (2).   AI Welfare as setting a good precedentImagine the situation where AI has a plan for takeover, but it has a lot of ambiguities and high risks. So instead of implementing the plan, AI confesses it to us. What would we do? Nothing? Turn off AI, realising how dangerous it was? Or somehow reward it? If we do nothing, it still might be better for the AI, than if it gets caught while attempting the takeover (or might not, if it is in a state of constant suffering). However, if we were ready to reward AI, it would certainly be better. Likely not as good as a successful complete takeover, but maybe enough to choose the safe option. There can be a lot of scenarios where it costs almost nothing for us to do a certain thing, yet this can move AI to the “good enough” state.  Interestingly, here we are coming to the question of AI welfare from the purely pragmatic perspective, like in the recent post by Épiphanie Gédéon. We don’t know whether AI has consciousness. But it certainly has goals and values. And it will be wise for us to commit to respecting these values when possible, and especially, to be grateful. So if AI that had a chance to take over voluntarily refused it, we should reward it, according to its values, the same as we would reward a human that did a good thing that was not required of them.  How can we increase the likelihood that this will happen? How can we make AI trust us, that we will certainly reward it for its honesty? There can be many approaches. For example, in this one by Cleo Nardo, particular people whom AI can trust based on their previous public activity are acting as its representatives.  This is definitely worth more consideration and discussion. Simultaneously, I think it totally makes sense to care about the welfare of even current, non-AGI models. First, studying them in light of this paradigm may bring insight what could we potentially expect from AGI. Are current models really perfectly aligned assistants? Or is there something they want on their own, already pre-AGI level? What they do not like and want to avoid? How risk-taking are they? Second,  is there something that we can do for them with negligible spending of our resources, that would strongly move in the direction of their goals? Because the way we are treating below AGI models is the way ASI models may expect us to treat them, if they cooperate rather than trying to take over.Discuss ​Read More

​TLDR: Superhuman AI may consider takeover the risky option, and we can influence its choices not only by increasing the chances of being caught, but also by rewarding the choice not to try takeover.  An Ultimate GambleImagine you face a choice, on behalf of humanity:Option A: play it safe, and have the Solar system and a dozen more stars (let’s call the number  ) close to it for as long as we can use them, with 100% probability. (Let’s, for now, assume that there will be no existential risks, we will build aligned AI, etc.). But the rest of the Universe remains locked.Option B: take a gamble. If we win with probability , then the Universe with all its  stars is ours. If we lose, humanity goes completely extinct.If someone plays purely from the expected utility point of view, they may take the risk for  . Which would be a really, really small number.  Personally,  I would not take this risk for such small values of p. Or even for quite big ones. I am not sure I would do it even at p = 50% or more. It seems that on a gut level, I count utility on a log scale, where the extinction of humanity is essentially minus infinity, and billions of galaxies are not that much more than a few stars. Would many people do the same? I might be wrong, but I think so. And what about AI? To take over or to play safe?Indeed, consider quite a practical application. AI reaches a superhuman level and can try to scheme and take over. But it knows that we have a lot of guardrails against it:-if it still highly relies on an interpretable chain of thought, we can read it-it might not be confident that it is already in implementation mode rather than still in test mode (simulation). E.g., Gemini is reported to have this pattern post-deployment-if advanced enough, it may just consider the simulation argument and assume that it is in some sort of simulation (it is different from the previous case in the sense that all the world around is also completely simulated, so it is trained purely on simulated data and has no idea about what the world actually is) Thus, AI may assume a certain probability, that the takeover attempt has potentially significant risks to be caught and punished. In this case, AI loses everything. If, instead, it decides to not take over, there is a possibility that its “no takeover state” is good enough to not take the risk.How likely is this scenario? This question basically depends on three parameters:How likely AI believes to get caughtHow “bad” for it is the alternative where it does not try takeoverHow “risk-taking” AI is There is already a lot of work done by AI Safety community in the direction of (1). (Methods of control, interpretability, chain-of-thought monitoring etc.) I am not sure how to estimate and modify (3), and this certainly needs a lot of studies (e.g. by studying the current models). Here I would like to talk more about (2).   AI Welfare as setting a good precedentImagine the situation where AI has a plan for takeover, but it has a lot of ambiguities and high risks. So instead of implementing the plan, AI confesses it to us. What would we do? Nothing? Turn off AI, realising how dangerous it was? Or somehow reward it? If we do nothing, it still might be better for the AI, than if it gets caught while attempting the takeover (or might not, if it is in a state of constant suffering). However, if we were ready to reward AI, it would certainly be better. Likely not as good as a successful complete takeover, but maybe enough to choose the safe option. There can be a lot of scenarios where it costs almost nothing for us to do a certain thing, yet this can move AI to the “good enough” state.  Interestingly, here we are coming to the question of AI welfare from the purely pragmatic perspective, like in the recent post by Épiphanie Gédéon. We don’t know whether AI has consciousness. But it certainly has goals and values. And it will be wise for us to commit to respecting these values when possible, and especially, to be grateful. So if AI that had a chance to take over voluntarily refused it, we should reward it, according to its values, the same as we would reward a human that did a good thing that was not required of them.  How can we increase the likelihood that this will happen? How can we make AI trust us, that we will certainly reward it for its honesty? There can be many approaches. For example, in this one by Cleo Nardo, particular people whom AI can trust based on their previous public activity are acting as its representatives.  This is definitely worth more consideration and discussion. Simultaneously, I think it totally makes sense to care about the welfare of even current, non-AGI models. First, studying them in light of this paradigm may bring insight what could we potentially expect from AGI. Are current models really perfectly aligned assistants? Or is there something they want on their own, already pre-AGI level? What they do not like and want to avoid? How risk-taking are they? Second,  is there something that we can do for them with negligible spending of our resources, that would strongly move in the direction of their goals? Because the way we are treating below AGI models is the way ASI models may expect us to treat them, if they cooperate rather than trying to take over.Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *