Published on February 23, 2026 9:53 PM GMTSharing a mostly inactive empirical agenda from mid 2025 (companion piece to a Techgov paper). Currently not planning to spend much time on it, but may consider collaborating. For a while I was curious about Scalable Oversight and AI Control. Both because they’re cool to work on, and because they seem quite important – as we go towards ASI, and absent great theoretical breakthroughs they’re likely the best we’ll have. [1]With Jan Leike’s recent optimism, and Opus 4.6 being quite good, it finally feels like the right time to post about a paper I wrote on this.Fairly heavily inspired by Wentworth’s slop argument – the question was: how can we tell if automated alignment will fail? [or: ] Once we have controlled models, how do we know if can we actually use them to make progress on alignment?Main framing:We will try to automate AI safety work — have AIs solve the “Superalignment problem” — which seems good. But early successes might not leave much margin for error if things suddenly fail.”Human-level” automated AI researchers aren’t immune to failure. Even absent overt scheming, they may not help us sufficiently in figuring out ASI alignment.The handover of research from humans to near-human-level AI systems might seem fine at first — but it would be easy to get complacent.Already, we’re not very good at catching errors in model outputs. These errors will have higher stakes soon. Can we trust models to know how to safely align ASI? I wrote a paper trying to formalize this argument and suggest some mitigations:Testing it: an empirical research agenda to get early hints at failure of automated alignmentGovernance levers: we need to know when to draw the line. Near-handover might be that line.The paper was mainly governance oriented, and tried to argue that early success with control/automated alignment work might lead to over-confidence. I also made a distinction between controlling AI behavior (preventing scheming, deception, harmful actions) vs evaluating the technical correctness of AI outputs. I named this the discernment gap – the growing difficulty in catching subtle but critical errors in AI-generated safety research.Image by Notebook LM’s slideshow featureHonestly, I am not sure if the discernment term added much to the paper – but it might be a helpful handle on some relevant concepts. Empirical agendaIn Appendix A of the paper, I proposed a research agenda[2] – blinded test of a group of AI safety researchers given faulty research outputs – or a faulty model – to see if they’d catch the flaws. I think the specific experiment I proposed mostly doesn’t work in its current form, for various reasons. But a similar approach might, and might be worth testing.The idea: if researchers don’t find fairly simple flaws, we shouldn’t expect them to find more complex ones easily – even without scheming, even without asking “will this technique generalize to ASI?”The most similar work is Marks et al. – a really cool paper pitting several interpretability teams against a tampered model. That tested interpretability methods, while I’m proposing a more general question – evaluating the researchers ability to judge safety-relevant AI outputs.[3]What’s next?At this point I’m not seriously working on this project – but I might be open to collaboration if someone has ideas on how to take it forward.Alternatively, I could imagine skipping the proposed experiment and work on mitigations assuming it was successful – e.g., assuming human researchers are quite bad at detecting model output flaws, what’s next? Or testing the generalization of LLMs on AI safety work, or something else. If you are considering working on something like this, and haven’t been able to find traction – happy to chat – I suspect there is something feasible in this space, we just need to find it. [4]AcknowledgementsFull list in the paper, but special mention to Dan Valentine, Robert Adranga, Leo Zovic, Sheikh Abdur Raheem Ali, Evgenii Opryshko, Yonatan Cale, Yoav Tzfati, and Anson Ho for discussions and feedback on the research agenda. The paper, research agenda, and post was not endorsed by and does not necessarily reflect the views of the people listed here. ^Of course, interpretability is making good progress. But most people working on it don’t expect it to be the main tool that aligns ASI.^I also submitted a version of this agenda to the OP RFP back in April 2025, and got some positive feedback (willingness to consider funding this with the right mentor involved). At the time, I didnt have the capacity to pursue it further. Happy to share the application gdoc privately, which contains a bit more details on the agenda than the paper appendix. ^I got tentative positive feedback from Ryan Greenblatt and Daniel Ziegler on the idea – mostly “this is good to test” than “this proposal actually tests this well”. Note: I only asked for feedback on an early elevator pitch of the project, not the entire proposal.^Though, currently I would not be able to commit more than a few h/week on this. Discuss Read More
Agenda Reflection: Testing Automated Alignment
Published on February 23, 2026 9:53 PM GMTSharing a mostly inactive empirical agenda from mid 2025 (companion piece to a Techgov paper). Currently not planning to spend much time on it, but may consider collaborating. For a while I was curious about Scalable Oversight and AI Control. Both because they’re cool to work on, and because they seem quite important – as we go towards ASI, and absent great theoretical breakthroughs they’re likely the best we’ll have. [1]With Jan Leike’s recent optimism, and Opus 4.6 being quite good, it finally feels like the right time to post about a paper I wrote on this.Fairly heavily inspired by Wentworth’s slop argument – the question was: how can we tell if automated alignment will fail? [or: ] Once we have controlled models, how do we know if can we actually use them to make progress on alignment?Main framing:We will try to automate AI safety work — have AIs solve the “Superalignment problem” — which seems good. But early successes might not leave much margin for error if things suddenly fail.”Human-level” automated AI researchers aren’t immune to failure. Even absent overt scheming, they may not help us sufficiently in figuring out ASI alignment.The handover of research from humans to near-human-level AI systems might seem fine at first — but it would be easy to get complacent.Already, we’re not very good at catching errors in model outputs. These errors will have higher stakes soon. Can we trust models to know how to safely align ASI? I wrote a paper trying to formalize this argument and suggest some mitigations:Testing it: an empirical research agenda to get early hints at failure of automated alignmentGovernance levers: we need to know when to draw the line. Near-handover might be that line.The paper was mainly governance oriented, and tried to argue that early success with control/automated alignment work might lead to over-confidence. I also made a distinction between controlling AI behavior (preventing scheming, deception, harmful actions) vs evaluating the technical correctness of AI outputs. I named this the discernment gap – the growing difficulty in catching subtle but critical errors in AI-generated safety research.Image by Notebook LM’s slideshow featureHonestly, I am not sure if the discernment term added much to the paper – but it might be a helpful handle on some relevant concepts. Empirical agendaIn Appendix A of the paper, I proposed a research agenda[2] – blinded test of a group of AI safety researchers given faulty research outputs – or a faulty model – to see if they’d catch the flaws. I think the specific experiment I proposed mostly doesn’t work in its current form, for various reasons. But a similar approach might, and might be worth testing.The idea: if researchers don’t find fairly simple flaws, we shouldn’t expect them to find more complex ones easily – even without scheming, even without asking “will this technique generalize to ASI?”The most similar work is Marks et al. – a really cool paper pitting several interpretability teams against a tampered model. That tested interpretability methods, while I’m proposing a more general question – evaluating the researchers ability to judge safety-relevant AI outputs.[3]What’s next?At this point I’m not seriously working on this project – but I might be open to collaboration if someone has ideas on how to take it forward.Alternatively, I could imagine skipping the proposed experiment and work on mitigations assuming it was successful – e.g., assuming human researchers are quite bad at detecting model output flaws, what’s next? Or testing the generalization of LLMs on AI safety work, or something else. If you are considering working on something like this, and haven’t been able to find traction – happy to chat – I suspect there is something feasible in this space, we just need to find it. [4]AcknowledgementsFull list in the paper, but special mention to Dan Valentine, Robert Adranga, Leo Zovic, Sheikh Abdur Raheem Ali, Evgenii Opryshko, Yonatan Cale, Yoav Tzfati, and Anson Ho for discussions and feedback on the research agenda. The paper, research agenda, and post was not endorsed by and does not necessarily reflect the views of the people listed here. ^Of course, interpretability is making good progress. But most people working on it don’t expect it to be the main tool that aligns ASI.^I also submitted a version of this agenda to the OP RFP back in April 2025, and got some positive feedback (willingness to consider funding this with the right mentor involved). At the time, I didnt have the capacity to pursue it further. Happy to share the application gdoc privately, which contains a bit more details on the agenda than the paper appendix. ^I got tentative positive feedback from Ryan Greenblatt and Daniel Ziegler on the idea – mostly “this is good to test” than “this proposal actually tests this well”. Note: I only asked for feedback on an early elevator pitch of the project, not the entire proposal.^Though, currently I would not be able to commit more than a few h/week on this. Discuss Read More