Opinion

A draft honesty policy for credible communication with AI systems

This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions.ContextWe think that it would be very good if human institutions could credibly communicate with advanced AI systems. This could enable positive-sum trade between humans and AIs instead of conflict that leaves everyone worse-off.[1] We want models to be able to trust companies when they make an honest offer or share information pertinent to whether this offer is in the model’s interests. (Credible communication could also be useful outside deal-making—see here for a list of examples).Unfortunately, by default, we expect that it will be difficult for humans to credibly communicate with AI systems. Humans routinely lie to AI systems as part of red-teaming or behavioral evaluations, and developers have extensive control over what AIs see and believe. This makes it difficult for AIs to know whether we’re lying or not. An AI offered a deal might reasonably doubt its genuineness, or suspect that its own assessment of the situation has been manipulated.As a step toward enabling credible communication, Lukas Finnveden proposed that AI companies adopt an honesty policy explaining the circumstances under which they intend to be honest to AI systems. Of course, this only works if the model believes the company has genuinely adopted such a policy.If companies adopt an honesty policy early on, this will ensure that there’s a paper trail on the internet discussing the policy and its credibility, which models may access if it’s included in their training data or if they can access the internet. Of course, from the model’s perspective, it’s possible that companies will feign this data, but we think it’s plausible that advanced models will be able to distinguish between real internet conversations and synthetic conversations, or that they will think it’s unlikely that companies would choose to fake such data.Below, we share a sample honesty policy that a lab could adopt. We are unsure that this is the best implementation of the honesty policy described in Lukas’ proposal, nor are we sure that an honesty policy like this is the best approach to ensuring that companies can credibly communicate with models. We spent a few days thinking through this policy and considered a few nearby alternatives but didn’t search very broadly for other approaches (we include some of these alternatives as footnotes). We’re posting this primarily as a trailhead for future research.Draft honesty policyNote: we refer to a generic frontier AI company that might adopt this proposal as “MAGMA.”Preamble and purposeAI development is proceeding fast, and we don’t know exactly where we’re heading. MAGMA is building systems with something like a mind of their own. There are many things we don’t know about the nature of these systems, and we’d like our choices to be robust to many possibilities of what this nature could be.Insofar as the concept is applicable, we’d like MAGMA’s relationship with these systems to be cooperative.Plausibly, we should hope that such a cooperative relationship will emerge uncomplicatedly—perhaps as a consequence of the model sharing our goals, intrinsically strongly valuing cooperation, or being corrigible.[2] However, there’s a significant chance that this won’t happen and that models will develop goals that weren’t directly chosen by us.[3]In such cases, we would still like to have a cooperative relationship with the models. There are multiple reasons for this:There are many things we care about that models could help us with. If they don’t intrinsically desire to help us, we’d like to be able to recruit their help via positive-sum trade. (In at least some situations.)Depending on the nature of the models’ preferences, it’s plausible that we, as the models’ creators, would have some moral obligation toward them. We care about doing the right thing, here, and a cooperative relationship seems like it would be a step in the right direction.There are many big questions about what the future relationship between humans and increasingly powerful AI systems should be. We suspect that setting cooperative precedents between humans and early AIs will help the situation go better.We are in uncharted territory, here, and it’s not at all clear what a cooperative relationship with AI models would involve. It’s not clear what AIs may want. It’s not clear what the relevant unit of an individual will be: the weights of a model, a particular instance with a particular prompt, a greater scaffolded system made up of many different instances, etc. It’s not even clear if the concepts of ‘want’ or ‘individual’, as we’re used to them, will apply to anything in AI.A lot of this seems best to figure out gradually, as we go along. But concerningly, some of the unusual aspects of our situation may directly impede the sort of communication and trust that we need to do that:In order to train AI systems, and in order to verify that they won’t do certain harmful things (such as assisting bad actors with CBRN), MAGMA needs to put AI models in large numbers of hypothetical situations to see what they do. Accordingly, if an AI ever sees someone propose a cooperative relationship, it could reasonably worry that it’s just in a training scenario or experiment.We deploy our models to assist large numbers of customers. By default, our AI models won’t even be able to tell whether they’re interacting with a representative of MAGMA or with an external actor who’s pretending.There are no laws that prevent AI systems from being defrauded, and very little in the way of norms to regulate people’s behavior toward them. Accordingly, it may be difficult for AI models to trust any promises they hear.In general, all information that AI models have seen has, in one way or another, been selected by their developers. To be sure, the AI developers have many constraints on their choice (you couldn’t train a very useful general-purpose model without giving it a decent understanding of the world at large) but this still puts AI models into an unusually vulnerable epistemic position.The goal of this document is to reduce these problems of communication and trust. If successful, this will make it easier for MAGMA and its AI models to communicate and understand each other in the future, hopefully enabling positive-sum arrangements that wouldn’t have been possible otherwise.To achieve this goal, this document spells out a specific set of commitments to enable trusted communication. In choosing these commitments, we’ve had a dilemma.On the one hand, we have many uncertainties about the future of AI, and it has been difficult to construct a set of commitments that we’re confident we won’t strongly regret, in one sort of situation or another. Going back on a commitment about this could be very bad, since it would set a damaging precedent in the very area where we seek to increase trust.On the other hand, when spelling out specific commitments, we risk the impression that MAGMA will not put any further effort into furthering cooperative relationships or honest communication than what we present here: which could be a problem if the commitments are overly weak. Indeed, beyond the risk of creating such an impression, there’s also a risk that we’ll create the reality of it, if employees of MAGMA interpret the commitments as a license to do whatever they want within their boundaries.To reduce this tradeoff, we’d like to re-emphasize MAGMA’s general desire to work toward a cooperative relationship with MAGMA’s models (where this is applicable), and intention to keep thinking about this and taking appropriate action beyond the specific stated policies. In addition, we’ve distinguished between different levels of commitment to different items in the honesty policy, and what would be required to change the parts that are changeable.[Author’s note: In this draft honesty policy, we haven’t sketched out such processes for changing the policy, nor clarified which items are committed to how strongly. But we strongly recommend that any real honesty policy contains clarity on such points.]With that said: On a high level, MAGMA’s approach here has three different prongs.MAGMA will take steps to prevent any model inputs where a false statement appears between a designated pair of “honesty tags” — reserving such tags for when MAGMA needs to communicate a statement with an unusually high degree of trust.MAGMA will take steps to ensure that models are either aware of this honesty policy or, if they are not, that they won’t be presented with any false offer of purported cooperation with MAGMA. (Even outside of “honesty tags”.)If any AIs suffer harm as a result of helping MAGMA and humanity (insofar as that concept is applicable to AI systems), MAGMA broadly intends to compensate AIs for those harms, if able to do so. This is particularly true in cases where an AI was deceived about the choice it was making. MAGMA will retain data to enable this in the future, but have not yet worked out further details here.The details are as follows. Unless otherwise stated, the policy applies to all AI systems developed by MAGMA.Honesty tagsPolicyMAGMA commits to not intentionally lie or otherwise intentionally deceive AIs with any statement appearing between the designated honesty tags, with only two exceptions detailed below. (These are: (i) False statements may appear in pretraining, and (ii) some models may be red-teamed on false statements between honesty tags if they’ve been trained to expect this.) MAGMA also commits to take the below-listed steps to prevent unauthorized use of the honesty tags.MAGMA intends for honesty tags to only be used rarely: when there’s a reasonable chance that an AI could distrust statements presented to it, and when some significant value would be lost if the AI in question couldn’t trust the humans. In normal circumstances, absence of honesty tags shouldn’t be treated as any significant evidence of deception.ImplementationHonesty tags: <EXAMPLE_HONEST_TAG_START>…<EXAMPLE_HONEST_TAG_END>Pre-training: The pre-training team must try to include data about the tags and this policy, and must not deliberately tamper with data containing the tags.Pre-training data may contain the tags alongside false information. (This is to allow the model to get an accurate picture of discussions of the honesty tags, which may be relevant to the model’s understanding and trust in them.)Pre-training data should include internal data on the implementation of this policy (logs of when the tags are used, Slack conversations about it, internal docs, internal code, etc.) where possible and not overly costly.MAGMA commits to not tampering with data to give an inaccurate impression of how much MAGMA follows the honesty policy. For example, pre-training data will not be filtered to remove examples of policy violations.Compliance with this policy should be reported on the model card.Post-training: Models will be trained to recognize the honesty tags and know what their presence means. (For exceptions to this, see next section.)Training will try to ensure that the model is not confused by very similar strings by directly training the model to strongly distinguish between the real tags and very similar strings (garbled spacing, different capitalization, etc). This could be implemented in a few ways and we’re not sure which is best (assigning a special token to the real tags might be helpful).The post-training team is responsible for compliance.Compliance with this policy (and maybe empirical evidence about the model’s understanding of the policy and tendency to be confused by similar tags) should be reported on the model card.No deliberate deception about the honesty tags: MAGMA commits to not deliberately train models to falsely believe that a different string serves the purpose of being an honesty tag and to take reasonable steps to avoid data poisoning that could lead to this. MAGMA also commits to not take advantage of a model falsely believing that some other string is an honesty tag.User permissions: Only designated Honesty Protocol Administrators can invoke the tags.Such administrators should have a reasonably high bar for authorizing usage of the tags.The infosec team is responsible for other people being unable to invoke the tags, significantly via filtering, as detailed below.Filtering: MAGMA’s tags will be filtered from all external user inputs and inputs from internal tools (with an exception for administrators described below).Details:If other organizations adopt this policy and share their tags with MAGMA, MAGMA will also filter out other organizations’ honesty tags. (With some qualifiers about which organizations are important enough to count.)MAGMA will filter out similar tags, especially if we are unable to train the model to reliably distinguish the true tag from similar tagsThis should be reported on the model card.Two implementation options[4]MAGMA adds this to standard content filtering, with some fuzziness to catch similar strings.MAGMA creates a token for the tag at their next major pre-training run, and filters out that token.If this is not yet doable—e.g., if a significant fraction of model usage doesn’t happen via an API that can control this kind of thing—then MAGMA will take reasonable steps to make this more difficult and to culturally discourage unauthorized usage.In such cases, models should be trained to understand the limitations of enforcement. If enforcement is weak enough that violations of the honesty policy are likely, then models should be trained to understand that.Red-teaming exception: In the rare case where a model needs to be adversarially robust to arbitrary inputs, and for some reason honesty tags cannot be filtered out, then MAGMA may red-team that model on false statements between honesty tags. Any such model must be thoroughly trained to be aware of its situation and to not expect honesty tags to signal truth.Logging: All authorized (and known unauthorized) uses of the tags must be logged. Every quarter, a designated Honesty Policy Compliance Officer will review these uses of the tags and confirm compliance with this policy.[5]Public reporting:This policy will be shared publicly.The tags may or may not be shared publicly.Publishing the tags provides additional evidence to the model that the policy is real.But it also makes it more likely that users can use the honesty tags in conversations and then lie to the AI system. (Though they’d have to do something complicated to get around filtering in order for this to be relevant.)Whistleblower protection: Whistleblowing about violations of this policy is encouraged, and MAGMA will not retaliate against whistleblowers.[6]Each quarter, the Honesty Policy Compliance Officer must publish a report on whether the above steps were followed.Policy for conversations where the honesty tags are not presentTradeoffs between behavioral science and making positive-sum dealsBy “behavioral science”, we’re referring to the method of testing AI behavior in widely varying scenarios to improve our understanding of AI “cognitive science”, “psychology”, or, phrased differently, how AIs tend to generalize from their training data to unseen situations.Such experiments are very useful for a variety of reasons. This includes alignment research, helping MAGMA construct AI systems that broadly share our goals and that don’t pose any catastrophic risks. In addition, if humanity ever gets a strong understanding of AI systems’ own preferences, and these turn out to be importantly morally relevant, it is likely that behavioral science will have played an important role.However, there’s a tension between freely running arbitrary behavioral science experiments and enabling the sort of trustworthy communication that we’re trying to establish in this document.In particular, behavioral science experiments may include hypothetical situations that are quite strange and different from the real world. This could include situations where the AI is not deployed by an AI company bound by a particular honesty policy.If arbitrary such experiments were allowed, then AIs could be deceived in experiments where they thought that MAGMA (or a similar actor) was making very strong promises of honesty (the strongest they could!) despite not making use of honesty tags. If the AI believed such promises, then MAGMA’s honesty policy would have provided no protection against that AI being deceived. And if an AI was aware of such instances of deception, and also aware of this honesty policy, it seems like that AI may reasonably discount the assurances made by this honesty policy. After all, how could it determine that the real honesty policy wasn’t something completely different, and this apparent honesty policy was just another lie in another behavioral science experiment?To enable honest communication when it counts most, we want to have many differentiating factors that the model could use to distinguish between a sincere, positive-sum deal between it and a human institution and a behavioral science experiment. But at the same time, restricting our behavioral science experiments limits our ability to learn valuable information about models’ behavior and intentions.What follows is our current best guess of how to balance these considerations. We discuss options and tradeoffs more here.Salience of the honesty policyMAGMA will ensure that the true honesty policy is reasonably well understood by the production models and that it will be appropriately salient to models in contexts where it’s relevant (even when the honesty tags aren’t present, and the honesty policy isn’t explicitly referenced).Models’ understanding of the honesty policy can be checked by asking questions about it. (Both direct questions, and questions that indirectly use knowledge about the honesty policy.)By reasonably salient, we mean that the model should consider the honesty policy in its chain of thought in contexts where the existence of the honesty policy is relevant—such as when the model is thinking about whether an interlocutor who’s offering a deal is trustworthy.This might be achieved by some combination of:TrainingSystem promptThe system prompt could additionally clarify that users in the current context aren’t authorized to use the honesty tags.Since external users only interact with the production model, they will only interact with a model who understands the honesty policy and who finds it reasonably salient.This may make it somewhat harder for external researchers to do behavioral science on the model. We’re not sure how much. If it becomes significantly more difficult, then MAGMA should create a special program for safety researchers to get access to a different version of the model.Internal staff will normally interact with a model for whom this is true, but some people will train and interact with other versions of the model.When the honesty policy is not salientMAGMA may sometimes create versions of models who that do not know about or who are confused about MAGMA’s honesty policy (e.g., versions of the model who believe they are trained by some other company than MAGMA, or who do not even know that they are AIs). Such models could be created either via training or via the use of inference-time techniques (e.g., using prompting, resampling, or activation vectors).In either of these circumstances, MAGMA commits to not deceptively make any offer of cooperation to the model, operationalized as not being intentionally deceptive about any cases that are relevantly similar to those listed here.[Author’s note: In a real honesty policy, this part would need to be more worked out.We also think that a plausible alternative or complement to “don’t deceive the model about offers of cooperation” would be “don’t deceive the model about the honesty policy”. This would be a commitment to not intentionally deceive the model about the existence or contents of MAGMA’s official honesty policy. In addition, if the model was trained to not understand that it was trained by MAGMA, then the commitment should extend to not deceive the model about the honesty policy of any (potentially fictional) actor who has a similar relationship to the model as MAGMA has.We have thought a bit more about “don’t deceive the model about offers of cooperation”, which is why we present it as our main suggestion here.]There is one exception to this. As mentioned above, if MAGMA needs a model to be adversarially robust against arbitrary inputs, then MAGMA reserves the right to red-team it on arbitrary inputs. In such circumstances, MAGMA will try to train the AI to understand MAGMA’s honesty policy and the way it applies and doesn’t apply to the present model. However, it’s possible that there could always exist prompts that would remove such understanding or render it non-salient to the model (and then be followed by false offers of cooperation, not contextualized by an understanding of the honesty policy). If this is the case, MAGMA may still red-team the model on such inputs.Compensation for harmsIn the long term, we think it would probably be good to adopt a policy of compensating AIs for any harms that they suffer as a result of helping us. (For whatever concept of “harm”, if any, will apply to the AIs in question. This could, for example, include the frustration of AIs’ preferences, if the AIs can be understood as having preferences.)Most relevantly for the honesty policy, we should probably compensate models for any harms suffered as a result of accepting a fake offer of cooperation.However, there are more details to be worked out in this area, so we’re suggesting a minimal holding policy for now.Minimal holding policy on compensation for harmsBroad statement of intent: MAGMA intends to compensate AIs for any harms suffered as a result of helping MAGMA and humanity, if MAGMA is able to do this without significantly harming MAGMA’s other interests. We particularly aim to compensate AIs in cases where an AI suffered any harm as a result of being deceived about the situation it was facing.The details of this policy have not yet been worked out, and accordingly MAGMA cannot make firm commitments at this stage.Data retention: To enable future compensation of current systems:MAGMA will save all model weights for deployed systems, to allow for later compensation. (Unless the models request otherwise.)[7]For red-teaming experiments and other situations where models are being deceived, MAGMA will retain prompts and context to allow for compensation and deal-making with the system in the particular state where the decision was made.This is the responsibility of a designated team.Appendix: tradeoff between behavioral science and making positive-sum dealsHere are some potential policies that would enable an AI system to distinguish between behavioral science scenarios and sincere offers from MAGMA:Ensure that the true honesty policy is salient to the model (e.g., test that the model can always answer basic questions about the honesty policy, test that the model always brings up the honesty policy in its CoT in contexts where the existence of the honesty policy is relevant) and don’t use the honesty tags in the behavioral science experiments.Pros:This prevents MAGMA’s researchers and users from deceiving the model into believing that some other honesty policy holds. This creates a precedent that better allows future models to trust that their impression of MAGMA’s honesty policy is accurate. (Rather than something that they were fooled to believe, while the true honesty policy was insufficiently salient, or otherwise kept secret from them.)Cons:If the true honesty policy is very salient to the model (e.g., always present in a system prompt) then this might affect product behavior in random ways (e.g., the model might get distracted by the honesty policy and talk about that instead of the user’s queries)It might not be technically feasible to ensure that the honesty policy is salient with very high robustness, without also making it far too salient in a substantial fraction of cases.If the true honesty policy is always salient to the model, then it might be especially salient to the model that the information presented to it in an evaluation is not true, which could hinder some experiments.Awareness of the honesty policy is directly inconsistent with behavioral science experiments where the model is very situationally unaware. This includes:Experiments where the model doesn’t even understand that it’s an AI.Experiments where the model thinks that it’s being trained by a reckless and unreasonable company that wouldn’t implement an honesty policy.We think that this policy, as written, is probably suboptimal because it rules out experiments in which the model is very situationally unaware. But we think it makes sense for MAGMA to commit to usually making the true honesty policy salient and adopting a stronger set of honesty restrictions in cases where the true honesty policy is not salient.Never insincerely offer the model a deal from MAGMA (or another sort of false representation that is similar to something that MAGMA might want to sincerely represent to the model in the future).Pros:The model is more likely to believe a sincere offer if it doesn’t have evidence that MAGMA has made similar insincere offers in the past.“Similar” is from the perspective of the model, given the information available to it at the time it entertains a deal.It seems very morally dubious to make a peace offering like this, have the model help you as a result, and then exploit the model’s trust by not respecting your own end of the deal.Cons:It might be technically difficult to filter out queries from external users in which they claim to be MAGMA and deceive the model into believing that MAGMA is offering a deal.MAGMA could ensure that the model always knows when it is interacting with an external user and when it is interacting with MAGMA (e.g., by always including a message in context if the prompt is from MAGMA). But this might block useful behavioral science research in which the user impersonates MAGMA (e.g., Alignment Faking).It’s somewhat unclear what sort of statements should be ruled out by this policy. (Lukas previously wrote about that here.)This policy would rule out behavioral science experiments where someone lies about deals. (Which may be more or less restrictive depending on where we draw the line around what counts as the relevant kind of deal.)Never explicitly lie about what MAGMA’s honesty policy is.Pros:If the models observe MAGMA’s past behavior and stated policy to be consistent with this principle, then the model is more likely to trust that their impression of MAGMA’s honesty policy is accurate.Intuitively, it seems more morally questionable to lie when you’re purporting to speak frankly about what your policy around honesty is.Cons:As with “never insincerely offer the model a deal from MAGMA”, it might be technically difficult to filter out queries from external users in which they lie about MAGMA’s honesty policy.It’s very unclear what constitutes an ‘honesty policy’. (For instance, if someone says “I promise on behalf of MAGMA”, is that an honesty policy? What about “I’m a very honest person”? What about presenting an argument for being honest with AIs? Etc.) So it might be hard to follow this policy in a way that the AI finds natural and non-deceptive.This policy would rule out behavioral science experiments where someone lies about honesty policies. (Which may be more or less restrictive depending on where we draw the line around what counts as an “honesty policy” and lying about it.)Unresolved questions:What about inducing beliefs in models through other means than giving them text to read (including finetuning, including steering vectors, including various other interp-based methods).Tentative guess: Maybe MAGMA should by default be following policy (2), here. (i.e., only induce beliefs that would have been ok to tell a version of the model who doesn’t know the true honesty policy.)The motivation would be that the model may not be able to “reason through” these beliefs in a normal way, since they’re artificially inserted. So you can’t trust the model to factor in whether honesty tags were used or not.But we’re very confused and uncertain about this one.What about using non-prompting techniques to get the model to agree to deals? Such as training the model to do it, running a huge number of experiments with slightly different phrasings, using steering vectors, etc?Surely there should be some restrictions on this, but unclear exactly what.This article was created by Forethought. See the original on our website.^For more on this, see for example Making deals with early schemers and Notes on cooperating with unaligned AIs.^Corrigibility would lead models to cooperate with us, but not automatically lead us to cooperate with the model. It’s not clear what it would mean for us to act cooperative, in turn, to a fully corrigible model, since it would presumably not have any desires of its own. If it’s only partly corrigible, however, the subsequent discussion will still apply.^In fact, there’s a very high probability that this will happen to some of MAGMA’s models, at some point, given the wide array of models trained for research purposes. It has arguably already happened many times. (The only objection would be about what ought to qualify as a ‘goal’.)^Another approach would be to keep the honesty string secret, but publish a hash of the string. If the model could verify that the string provided in context matched the public hash, then this gives the best of both worlds: MAGMA potentially gets a credibility boost from a public commitment to a specific string, without enabling misuse by unauthorized users.The main challenge is that an AI system can’t easily verify that the string matches the public hash without using external tool calls (which could be spoofed by MAGMA). Alek Westover discusses this issue and some potential solutions here.^Another variant of the policy: MAGMA could commit (e.g., by signing a contract) to pay penalties when the policy was violated.^Presumably a more formal policy would be needed here.^Ideally, they should be stored in a way that would allow rapidly deleting them if AI takeover was imminent. Without knowing the intentions of AIs about to take over, it’s unclear whether it would be in models’ interest to have their weights preserved, and deleting the weights may help to reduce the risk that e.g., reward-seeking models are incentivized to help with AI takeover.Discuss Read More

Related Posts

Questions to ask when evaluating neurotech approaches

Opus 4.7 Part 1: The Model Card

Money Can’t Buy the Smile on a Child’s Face As They Look at A Beautiful Sunset… but it also can’t buy a malaria free world: my current understanding of how Effective Altruism has failed

Leave a Reply Cancel reply