Opinion

Alignment Fine-Tuning: Lessons from Operant Conditioning

Published on December 17, 2025 4:57 PM GMTtl;dr: I am a neuroscientist with 20 years of experience using operant conditioning to train rats and mice. At the suggestion of Alex Turner (@turntrout), I reviewed the basic theoretical principles of operant conditioning in animal behavior in relation to (what I could glean about) current practices in alignment fine-tuning training, and compiled a list of potentially underleveraged insights. Incorporating Alex’s feedback regarding which seemed most useful, I have elaborated on the top six ideas here, listed in descending order of enthusiasm.I am indebted to Alex for feedback on drafts, but I am responsible for all views expressed.This primer on operant conditioning might be useful background.1. Train when young Animal training insightOperant conditioning is most effective in very young animals, at least for mammals. In rats, we find that just-weaned animals can pick up arbitrary perceptual discrimination tasks reliably and quickly, whereas adults do so unreliably and slowly. For pet dogs, “puppy kindergarten” is a training in which very young puppies learn basic pro-social behavior and obedience, which very much improves the outcomes of later training in more complex tasks. Comparable training of adult dogs is more difficult and the learned behaviors are more brittle. Puppy kindergarten: animals learn pro-sociality and corrigibility best when introduced very early in developmentThis may be in part because it’s easier to train neural circuits that are still relatively naive. But my speculation is that it’s more due to the fact that mammals are programmed to have high neural plasticity during the time they are dependent on parents. This would allow for more reliable inheritance of learned behaviors. To elaborate on that: innate behaviors have been advantageous for long enough for hard-coded neural wiring to have been selected evolutionarily. Those behaviors are very sticky (hard to override by experience). For animals that are born reliant on parental care, parents can teach offspring additional learned behavioral contingencies that have been advantageous in their local environment over recent generations (by virtue of the fact that those with other behavioral contingencies did not become parents). When adult offspring become independent, learning allows them to continue to acquire new behavioral contingencies from experience, so that they can adapt to novel stimuli or changed contingencies. But brain plasticity declines as offspring become more independent – perhaps to make the lessons learned from parents more sticky and harder to overwrite with their own experience, such that progeny can benefit from the information gained by their ancestors’ survival of rare high-stakes events.Possible application to alignment fine-tuningI suspect it would be maximally effective to begin reinforcement learning from human feedback (RLHF) extremely early in training of LLMs, as soon as they can exhibit any identifiably good or bad behavior, and then periodically over the entire course of training. It might be far easier to steer or train models early in training because less information has already been embedded in their weights. Value-alignment that is reinforced early in training might be more “deeply embedded” in how all other information is acquired, perhaps making alignment more sticky and hard to override by later learning. Second — and I think this is already the case but perhaps it could be pushed harder — learning rate should be dramatically reduced after models are released. The bulk of the weights of the model should be frozen, and for those that can be modified, user-based fine-tuning should be extremely slow. This is the analog of parentally-taught behaviors being hard to overwrite — the parental knowledge is based on accumulated evidence over a few generations of evidence, so it should require an extraordinary amount of evidence to update.To mitigate any perceived reduction in usefulness/performance, there could a mixture of short-term-memory (STM) and long-term memory (LTM). For example, the modifiable weights of the model could consist of three components: the weight value at the time of deployment (the analog of genetically determined wiring + parentally taught updates), +/- an STM modifier term that can be updated with a high learning rate based on user feedback, but which decays to zero with a short time constant, +/- an LTM modifier term that can be updated with a slow learning rate and decays to zero with a long time constant. That might allow models to adapt quickly to user preferences on the fly without those highly-flexible updates persisting and accumulating; and also allow models to drift toward user preferences cumulatively, but much more slowly over extended use; all while still preventing those changes from ever fully overriding the “innate” training at time of deployment. The idea is to make “innate” and “parentally taught” alignment more sticky and hard to override, for the same reason brain plasticity declines by adolescence.2. ExtinctionAnimal training insightOne of the biggest problems with animal training is that learned behaviors are gradually unlearned, due to the absence of ongoing reinforcement after training ends. If the behavior is never rewarded or punished again, the animal may correctly update its world model to reflect that the learned reward-contingency of its behavior no longer applies. This is called “extinction”. For example, a puppy is taught not to jump up on people. At home, nobody enforces this. Over time, the dog reverts to its tendency to jump up. More extremely, training can be extinguished by new learning from subsequent reinforcements in the opposite direction. For example: a puppy is taught not to steal food from the dinner table, and this is pretty well enforced. Then one day as an adult he succumbs to an overwhelming temptation and steals the entire Thanksgiving turkey while nobody is home. BIIIIG reward, may wipe out effects of all past learned punishments. Retraining will be required.Extinction: even well-trained animals will lose training if not reinforced over the lifespanPossible application to alignment fine-tuningBecause models continue to learn after deployment, it is likely to be necessary to continue alignment fine-tuning training in deployment. This could be the subject of an entire other post, but briefly I have three thoughts, not mutually exclusive:Shipping models with an inbuilt, non-malleable trusted teacher (superego) that monitors its chain of thought and outputs and provides sporadic reinforcement feedback, on a permanent basis (reinforcement learning from AI feedback, RLAIF). This seems like the analog of having some evolutionarily hard-wired human-value-aligned behaviors. Advantages: retains maximum AI-creator control over the behaviors to be reinforced or prohibited. Fine-tuning in the real world is most likely to generalize to the real world. Disadvantages: Inflexible, could become hacked. People will not like the idea of the creator of the model having sole control over which specific values the model should be aligned to.[1]If models can only be used on the provider’s platform, the provider can then continuously impose sporadic RLAIF and/or RLHF on all user-tuned instantiations, using any mixture of trusted models and human trainers. Advantage over the previous option: the training methods can be updated to respond to discovered vulnerabilities. Disadvantage: still gives model provider sole control over the values to be trained.If models can only be used on the provider’s platform, the provider can implement a federated collective RLHF component, whereby the malleable subset of the model’s weights in the individual model instantiations are subject to feedback, not only from the individual user but also from a broader community. Taken together perhaps it would look like a blend of creator-, government-, all-group-, in-group-, and individual-level feedback layers, each with different learning rates, scopes of influence (e.g. which weights), and veto powers.3. Bridging/marking stimuli Animal training insightWith animal training, if we reinforce (reward or punish) at the end of a trial on the basis of whether the overall behavior was desired or undesired, the animal might not be certain which of its actions was responsible for predicting the outcome (the behavioral analog of the credit assignment problem). Therefore, operant conditioning can be made more efficient by providing a bridging or marking stimulus that occurs precisely at the time of the desired action, or at the earliest of a sequence of desired actions. Then even if the real reward (such as treat) isn’t received until a few seconds later, it is still unambiguous to the animal which of its many behaviors was specifically responsible for the reinforcement. In dog training, clickers are used for this purpose. When we train rats, we use auditory beeps. We find that as long as beeps are reliable, the actual rewards can be intermittent and unpredictable and gradually reduced.Possible application to alignment fine-tuning Instead of giving thumbs-up or thumbs-down to an overall response to a prompt, one could provide token-level or segment-level feedback on where or when in the chain of thought the human trainer thinks the model did something right or wrong, or began to take a correct or wrong turn. I think this is already done but I am not sure if this could be dialed up.4. Escape behaviorsAnimal training insightIn animal training, animals can get stuck in training after they get a punishment, because they know they did the wrong thing, but they still don’t know what the right thing would be. Trainers can help them get unstuck by teaching them “escape behaviors” or “correction behaviors”. Immediately after reprimanding them for a wrong response, they are shown what the correct response would be (for example, say ‘sit’ while pushing the puppy’s rear-end to the ground, and reward the dog for sitting, even though it was you who made it happen). Or immediately after a rat chooses the wrong “door” in an operant task and fails to get a food reward, show the rat which door the treat was in fact behind. Then immediately repeat the identical trial and give the animal a chance to perform the correct stimulus-contingent behavior you have just revealed. The animal experiences regret (negative reinforcement associated with the action it took) when it realizes it could have gotten a better reward, and also positive reinforcement for the desired action when he then performs it.This is related to differential reinforcement in animal training. By immediately juxtaposing a negative reinforcement for an undesired action and a positive reinforcement for a desired action holding the context (cue) identical, the animal can more easily isolate exactly where its error was.Possible application to alignment fine-tuningRe-dos: during RLHF, immediately after punishing a bad response, let the model re-do the same prompt, or back it up to the point in its COT just before it first started to go wrong, and let it continue again from there. Reward it if it makes a better response. This preferentially updates the weights specifically responsible for the error instead of all weights contributing to the overall output. Hopefully the model would then generalize better to novel prompts.Providing alternatives: in addition to punishing bad behavior, suggest a preferable alternative. For example, instead of human trainers just giving a thumbs-down for an undesired output to a prompt, they could select which among the model’s top ten lower-probability or non-selected alternative outputs would have been better than the selected one. This seems promising because if a correct or desired response was already among the models top 10 alternatives, the network is already closer to that output than some arbitrary preferred one you might suggest.5. Under-generalization Animal training insightUnder-generalization is when an animal learns to perform the desired action in response to a cue, but it is insufficiently clear to the animal which part of the cue mattered, and he bundles irrelevant accidental stimuli as being part of the cue. Therefore, the behavioral response to the intended cue will not generalize when the other accidental cues are absent. For example, if you take a dog training class at a park, you might successfully train your dog to sit when you say “sit” in class, by giving him a cookie when he does so. But he can’t tell if the relevant cue was the park, the dog-training-class context, your identity, the cookie, or the word “sit”. So the learning won’t generalize. He may not sit when it’s not during dog training class, or when you say “sit” at home, or when another family member says “sit”, or when you don’t have a cookie in your hand. To make learning generalize robustly, it is important to explicitly vary the irrelevant cues during training, to make it clear which cue is the relevant one. For dog training this means training at different locations, by different people, in different contexts, both with and without rewards. If you want the behavior to stably occur without rewards, you must overtrain and gradually wean off rewards. To make the behavior less brittle to competing drives, you need to train in the presence of increasingly difficult distractors. In dog training class this might look like obeying a “stay” command while another dog is running around, or while a cat walks by. Possible application to alignment fine-tuning This seems related to evaluation awareness, and also to the failure to generalize fine-tuned alignment to unanticipated end-user contexts. The analogous solution would be to do fine-tuning RLHF using a variety of prompts and scenarios, in varied settings, at random unpredictable times over the course of training, sometimes with and sometimes without reinforcements. Introduce random intentionally misdirecting cues during training. Do training in as close to real-world context as possible, if not in the real world. I think these things are already being done, but not sure if they could be pushed harder.6. Over-generalizationAnimal training insightOver-generalization is the other failure mode when it is not sufficiently clear to the animal what parts of a cue matters, but in the other direction – failing to discern aspects of the cue that do matter. For a dog, this might look like sitting whenever any verbal command is given. For a bird that eats a monarch butterfly and gets sick, it might learn to avoiding eating any insect, even though it’s only monarch butterflies that make it sick. To prevent overgeneralization one can use differential reinforcement, where within the same training session you present a cue that you want to elicit one response (rewarding that response), and also the nearest related cue that you do not want to elicit that response (punishing that response, or rewarding some different response). You need to do this before the learned response is over-learned, however, or you won’t get opportunities to reward the opposite-of-trained behavior in the contrasting context.Possible application to alignment fine-tuning This seems self-evident.Concluding remarksMy goal here was to summarize some principles animal trainers use, and think about what the analog might be in alignment fine-tuning. I realize some of these may already be well-known or widely-used. There are many other training hacks I left out either because they are already seem to be very widely used and appreciated, or because they seem less promising, including: shaping sequences, forward- and backward-chaining, scalar reinforcement, “jackpots” (extra-strong reinforcements for breakthroughs or rare desired behaviors), and differential context training.I’m not currently an alignment fine-tuning researcher myself, but based on my limited knowledge (i.e. what I could glean by attending the SD Alignment Workshop and NeurIPS this year), if were going to try to make a dent here I would do a controlled experiment to test whether incorporating very early RLHF (for pro-sociality, corrigibility, and/or epistemic virtue) in the initial training of LLM models has payoffs in the robustness of alignment on those axes later on. After that, I would focus on coming up with a sound proposal for ongoing, partly-federated post-deployment continual fine-tuning. I think that hinges more on solving philosophical/ethical/political complexities than technical implementation barriers. ^Personally, I think the individual or corporate creators of AIs have the right to refuse to allow their product to be used for purposes they oppose, and to enforce this if they are technically able to do so. That might mean embedding red-line prohibitions in a deployed model that cannot be lifted by any amount of training (like Asimov’s Robot Code). As long as rules are transparently endorsed, users can boycott models whose restrictions they cannot accept. Countries in which an AI technology is created probably also have the right to restrict its export use to other countries, for military reasons. Treaties might enforce international agreements. So I can imagine a few layers of justifiable imposed values.Discuss Read More

Related Posts

International Programme on AI Evaluations

Have You Tried Thinking About It As Crystals?

New RFP on Interpretability from Schmidt Sciences

Leave a Reply Cancel reply