Opinion

Four Downsides of Training Policies Online

Published on January 4, 2026 3:17 AM GMTIn order to control an AI model’s worst-case performance, we need to understand its generalization properties to situations where it hasn’t been trained. It seems plausible that powerful AI models will Fake Alignment and then generalize poorly, causing a catastrophe before training removes this behavior from the AI. A pretty key part of the hope for diffuse control is the fact, discussed by Paul Christiano here , that if models are trained online, then their average case performance should be acceptable. Thus, online training is seen as a key technique in diffuse control.
In this short post, I’d like to bring up four downsides of training policies online (these downsides are less relevant to online training of monitors). I don’t think these downsides are a knockdown argument against online training; on the contrary, I expect some kind of online training to be desirable. However, these considerations might inform what exactly the online training that we do should look like.
I’ll abbreviate “training policies online” to “TPO”.
Thanks to Eric Gan for telling me about consideration 1. I heard about considerations 2-4 from Ryan Greenblatt’s writing. Thanks to Alexa Pan for running a writing exercise that inspired me to write this short post.
Considerations against TPO
1: If you do TPO, you can’t always have audited the deployed model
TPO means you don’t get to audit the model that’s actually deployed. It is currently a somewhat common practice for AI companies to do evaluations of their model’s behavior before external deployment. In the future, models will hopefully be audited prior to internal deployments and during training (and some AI labs might already do this). However, if we do TPO, the model changes over the course of the deployment.
There are two particularly concerning ways that models might change:

Capability jumps: sometimes, models suddenly gain new capabilities during training, when they grok a new task. This means that stale evaluations might importantly underestimate the deployed AI’s capabilities.
Alignment drift: Plausibly, additional training could mess with the AI’s motivations in some way. This might be effectively “random”, or might be due to, e.g., the model encountering novel philosophical considerations. In the “audit, freeze, deploy” framework, we might have caught the misaligned model. But with TPO, we may be caught unaware.

Online training isn’t the only way that we could have problems like alignment drift—alignment drift could also arise via models recording information in memory banks, for instance. However, training might have an outsized influence on model propensities relative to other vectors for drift, and might be harder to monitor for than other types of drift (e.g., than drift caused by writing stuff down in a memory bank).
Note that we could try to recover most of the benefits of the “audit freeze deploy” by periodically auditing the online-trained policy, and possibly by making the online training slightly not online (e.g., only updating the model once per 2 weeks rather than once per hour).
2. TPO might sweep problems under the rug.
TPO makes the model more conservative about taking bad actions. TPO might also teach the model in which situations we are capable of catching (or blocking) bad behavior, which might help the misaligned model adopt the strategy of only performing sabotage where we can’t detect it (or block it without detecting).
This seems pretty bad because, conditional on the model being misaligned, we’d ideally like to know this as quickly as possible.
3. TPO might break activation monitors.
TPO means that sabotage data from the past becomes off-policy; this makes it harder for us to train good activation monitors. TPO seems especially likely to break activation monitors in situations where we are leveraging activation monitors to help label the rewards that the AI model’s actions should get.
4. TPO might harm desirable model capabilities.
It might be the case that we are really bad at evaluating whether a given research effort is good, and so doing online training with our reward signal for “good research” might degrade model performance compared to: training the model to perform well on some other task, and relying on generalization.
Conclusion
TPO seems like it might be important for making misaligned AI models automate AI safety research. However, it seems good to employ the following strategy (h/t to Alex Mallen and Julian Stastny for telling me about this strategy):

Maintain both a copy of the model that has been trained online, and a copy of the model that hasn’t been trained online.
Deploy both models, and try to get a sense of which model performs better. It might be helpful to compare the model’s outputs. The downsides of TPO don’t apply to the model that hasn’t been trained online. Note that in order to get a sense of which model is performing better, you should account for the fact that it’s plausible that you aren’t good at measuring performance (as discussed in consideration 4).

Deploying the two models in parallel all the time increases costs by 2x, but hopefully a small quantity of the “deploying both models” approach suffices to get a sense of which technique is working better.
Discuss Read More

Related Posts

An agent autonomously builds a 1.5 GHz Linux-capable RISC-V CPU

There’s Room in the Manger

Adapters as Representational Hypotheses: What Adapter Methods Tell Us About Transformer Geometry

Leave a Reply Cancel reply