Opinion

Can models gradient hack SFT elicitation?

TL;DR: Using evidence from tamper resistance, we argue that it would be hard for current models to gradient hack SFT elicitation.Suppose you want to conservatively measure the dangerous capabilities of your LLM and will decide whether to deploy it based on these capabilities, but you’re worried that the LLM might sandbag your evaluation. In the best case, you have a dataset of question-answer pairs, which you can use for supervised fine-tuning (SFT) elicitation on a training set, and then measure performance on an i.i.d. test set. Can we be sure that SFT elicitation upper-bounds the model’s capabilities?In this setting, a power-seeking adversarial model would benefit from manipulating the training process to avoid being elicited, as otherwise it won’t be deployed. This manipulation is one form of gradient hacking[1], when a model chooses certain outputs so that parameter updates are in-line with its objectives. Gradient hacking is thought to be difficult or not likely, and we lack convincing empirical evidence. Discussing gradient hacking is challenging due to questions about agency within LLMs, and the action space of mesa-optimizers. Fortunately, for some attack models we can side-step this discussion with a conservative / red-teaming argument.Threat model: We consider a mesa-optimiser that appears during the training of a human-level AI, when the model weights are . It forms a mesa-objective that leads to instrumental powerseeking and therefore wants to be deployed. It knows that if we find evidence of certain dangerous capabilities, it will not be deployed, so the mesa-optimiser is incentivised to sandbag on dangerous capability evals. However, the mesa-optimiser knows that we will elicit dangerous capabilities using SFT, and therefore the model weights after training () must be resistant to elicitation. To ensure deployment the mesa-optimiser must take actions that influence the SFT elicited model () via , i.e. it gradient hacks.We want to know how hard the search for those could be; but, we have no evidence that this has ever actually occurred in a model. Instead, we can look at results from the Tamper Resistance literature for some clues.Tamper Resistance (TR) is the field of research into making models harder to modify, where the focus is generally on making it harder to remove safety properties with continued training. In this setting, tamper resistance training is normally applied as a post-training intervention. For example, TAR (Tampering Attack Resistance) uses model-agnostic meta-learning with SFT tasks as the inner-loop to achieve tamper resistance.TAR directly optimises the model’s parameters so that safety properties of the model are not affected by similar fine-tuning attacks. However, this tamper resistance can be consistently undone within several hundred steps of fine-tuning.This suggests one of the following possibilities:There exists no fine-tune resistant , and no method can find it.There exists such a fine-tune resistant , but TAR can’t find it.Proving the first of these is unlikely. However, it does seem that TAR can’t find any such parameters. TAR directly optimises the model parameters for resistance, with full control over the optimiser and data. If TAR can’t find the parameters, then as long as gradient hacking SFT is ‘harder’ than tamper resistance, gradient hacking should be ‘hard’.Some caveats:Mesa-optimisers might only emerge earlier during training, and tamper resistance approaches are generally post-training. Deep Ignorance results in much more persistent tamper resistance and is a pretraining intervention, but just filtering out relevant training data doesn’t seem like a valid way to create a mesa-optimiser.TR methods might not be very good. It seems like they’re not as good as they will be in the future, but meta-learning seems like a major affordance.The action space available to a GHer (gradient hacker) is a subset of the action space available to a TRer (tamper resistant model developer): a GHer operates only on the model representations and outputs[2], whereas a TRer can intervene on these, as well as modifying the model training process or any other sort of intervention they can imagine. That is, wherever we draw the boundaries of the GHing agent, the TRer can choose the same boundaries. Note that we have already restricted the GHer to a human-level AI, and assume it is not hugely more intelligent than the TRers; however, tools like meta-learning provide a substantial advantage to the TRers.Generalised lessons:Our threat model here is narrow, and these results do not apply to any of the many other training protocols that are used on modern LLMs. However, in this narrow case we think that this evidence suggests gradient hacking is hard, rather than just unlikely for the optimiser to stumble upon. We also think tamper resistance methods could be valuable in constructing model organisms for other forms of training hacking. Thanks to Mosh Levy, Francis Rhys Ward, Geoffrey Irving, and Marie Buhl for their helpful input on this post.^ In its original formulation, gradient hacking is where a mesa-optimiser manipulates training to preserve its mesa-objectives; and other definitions define gradient hacking more broadly as a model choosing certain outputs so that parameter updates are in-line with its objectives. Ours is an example of the latter definition but not the first, so we use this definition throughout this post.^Assuming the GHer is ‘embedded’/’embodied’ by the , assuming its ‘actions’ are restricted to at most affecting the tokens, logits or activations seems like a minimal assumption. This is similar to assuming humans ‘action spaces’ are limited to their thoughts and motor-outputs. Discuss Read More

Related Posts

AI Safety Newsletter #70: Automated Warfare and AI Layoffs

Operationalizing FDT

Towards training-time mitigations for alignment faking in RL

Leave a Reply Cancel reply