Opinion

Bringing More Expertise to Bear on Alignment

​PreambleThe preamble is less useful for the typical AlignmentForum/LessWrong reader, who may want to skip to Adversaria vs Basinland section.On 28th of October 2025, Geoffrey Irving, Chief Scientist of the UK AI Security Institute, gave a keynote talk (slides) at the Alignment Conference. The conference was organised by the UK AISI and FAR.AI as part of the Alignment Project, which aims to bring experts from relevant fields to make progress on the alignment problem. TLDR:Adversaria vs Basinland. We might be in one of two worlds. One where alignment is adversarial (a security problem), one where it is navigational (a search for good basins of training behaviour). We don’t know which world we are in, and how we train and deploy AIs may determine this.We need new disciplines. The field is small, thinly resourced and approached from only a handful of angles. A few well-placed ideas from other disciplines could disproportionately shift what’s achievable.Even if this all fails, evidence of hardness is valuable. Moving past broad framing to detailsAlignment means ensuring that AI systems do what humans want. This is the broad framing. There is, of course, a lot of complexity and detail hidden in that simple description: Which (subset of) humans? On what topics or domains? What does “want” mean? It is also important to acknowledge that alignment isn’t the only problem we need to solve to ensure AI goes well. We should aim to move from the above broad framing to talk about the details and how they interact, in depth. We may have different takes on the details and it is easy to get blocked by such disagreements, but the different areas of expertise also come with different sets of tools, which often work for many choices of details. We should plan for superintelligenceEven if we are uncertain or may disagree about when or whether superintelligent systems may arrive, we broadly agree that it is wise to plan for it. While this is certainly the “hard case” for alignment, we should be aware that superintelligence is not magic. The non-magic version of superintelligence is:superhuman speed – this is already true for many tasks, even complex ones. hard to interpret – even more so than models today. superhuman heuristics in many domains – see for example AlphaGo, AlphaFold and others. On the other hand, human intelligence stays roughly fixed. As far as we can tell, there isn’t yet a known way for it to scale the way AI capabilities do. Tools and institutions expand what we can do, sometimes dramatically, but they operate on timescales much slower than the AI capability curve.This is a call to take the possibility of superintelligence seriously and set our sights high. This is not a claim that only superintelligence-relevant work matters. Methods and tools that help today, or help up to a known breaking point, still count — especially when we understand the breaking point.Adversaria vs BasinlandOne key uncertainty: how hard is alignment? We don’t know, in the deep sense that there are real model uncertainties on the basic character of the problem. To borrow framing from cryptography and complexity theory: in 1995, Russell Impagliazzo published “A Personal View of Average-Case Complexity” describing five possible worlds depending on which complexity-theoretic assumptions turn out to be true. In Algorithmica, so easily verifiable problems can be efficiently solved, so cryptography is impossible. In Heuristica, hard problems exist in principle but typical instances are easy, so cryptography is still impossible. In Pessiland, even typical instances are hard, but they can’t be harnessed for cryptography, we can generate hard puzzles but not build secrets from them. In Minicrypt, we can hide secrets, but two parties who don’t already share a key cannot establish one. In Cryptomania, all of modern cryptography works, even strangers can establish shared secrets over a public channel.Consider the following parallel for the alignment problem: Adversaria. In this world, the alignment problem is a fundamentally adversarial security problem. Optimisation pressures applied during training and deployment result in capable AI systems that will find and exploit any gap between what we specified and what we actually intended. If the system is smarter than us and has any incentive to subvert our oversight, our alignment solution needs to hold up against worst-case behaviour, not just average-case. In this world, we have a fractal attack surface, and any weak links in our defence can be fatal. Even partial solutions are dangerous and create a false sense of security. This world is rough. A deeper security mindset is more important here than in Basinland (below): we need more time, a far deeper understanding of the foundation and science of intelligence, and a more coordinated, restricted deployment with cyber and information security guarantees. Basinland. In this world, alignment is hard but not adversarial. Training and in-deployment learning have many basins of attraction. If we are near a good basin, automated methods and AIs can pull us the rest of the way to an aligned model. The task is then to identify such good basins and bootstrap ourselves into one. In this world, there is a certain level of error tolerance for the right combination of data, training algorithm, reward, objectives, and oversight for the system to end up in a good region of behaviour-space even under deployment pressures. There may be no single trick that solves the whole problem, but the pieces compose. Multiple partial solutions, each addressing part of the problem, can be combined into something that works, and the character of the difficulty is navigational rather than adversarial: we are searching for the right region, rather than defending against an intelligent attacker.An engineering mindset can prevail in this world: characterise what distinguishes good basins from bad, build the tools to steer training into them, and compose partial solutions into pipelines that collectively work, utilizing defense in depth. Although, monitoring and incident response still matter: Basinland is not safe, only non-adversarial.It is worth noting that being in Basinland and knowing that we are in it are two different things. Which world are we in?We don’t know. We need further progress in understanding intelligence, learning, and oversight to understand what determine the demarcation line between the worlds. As with the aforementioned Impagliazzo’s framing with cryptography, the point was not to pick a world but to make the structure legible: the right research strategy in each world is different, but across the different worlds we can identify the same few load-bearing open questions. Alignment has a similar shape. Adversaria and Basinland are two possible extreme worlds. We may not know for some time where we are, but we can still ask: what are the load-bearing questions whose resolution would tell us? Candidates for such questions include: under what training regimes do capable systems develop robust versus fragile alignment? How does optimisation pressure at deployment interact with specification gaps? Can we find oversight protocols that scale beyond human-level intelligence? These questions may not yet be sharp enough that resolving any one of them would collapse our uncertainty. Sharpening them is itself high-leverage work.[1]This also suggests that techniques that remain useful whether we’re closer to Adversaria or Basinland are more robust bets than techniques that assume one or the other. Interpretability that informs both worst-case defence and basin characterisation is one example. Evaluation protocols that generate evidence that inform us on the basic character of the alignment problem (the “which world” problem itself) are another.Why a few ideas might be enoughThe total investment in alignment research is small. Not small relative to its difficulty but small in absolute terms, on the order of a few thousand person-years of work, most of it concentrated in a handful of approaches. In comparison to other safety critical fields, the effort so far is thin. The question is whether we can substantially expand it: not just by adding more people doing the same things, but also by bringing in fundamentally different approaches. There is reason to think this could work. From the history of theoretical computer science, we find that it is full of problems where a single conceptual breakthrough completely transforms a problem from impossibly hard to easy. For instance, the entire edifice of public-key cryptography rests on a small number of core ideas: the realisation that certain mathematical operations are easy to perform but hard to invert, and that this asymmetry can be used to build secure communication. The PCP theorem, which showed that every proof in can be rewritten so that its correctness can be checked by reading only a constant number of random bits, transformed our understanding of what it means to approximately solve hard problems. The pattern here is that there can be phase changes, not just incremental advances, in what we understand and what we can do. Alignment may look intractable until someone finds the right framing, after which a large class of questions that seemed out of reach become approachable. Conversely, the right framing might reveal that a large class of approaches that seemed promising are fundamentally limited. Either outcome is clarifying.Because alignment has been approached from relatively few angles, the chance of finding a new angle that transforms what’s achievable may be higher than in a more mature field. Complexity theorists, learning theorists, economists, physicists or cognitive scientists may see structure in the alignment problem that we are currently missing. Write down problemsBeing a young field, alignment can be frustrating for researchers accustomed to more mature academic disciplines. Core concepts are fuzzy, definitions are still under construction and the path to positive impact of particular line of research can be unclear. Scientists and researchers are motivated by well-posed problems. Beyond awareness of the alignment problem, we may be bottlenecked by the awareness of the important problems and understand how it connect up to specific toolkits people may bring. If we want to bring more experts to bear on alignment, we should write down problems clearly, concretely and in languages of relevant research disciplines. It also helps to be explicit about what each formalisation captures and what it misses. Multiple new ideas, fitting togetherMapping the blackbox / whitebox boundaryOne new idea probably isn’t enough. A concrete example we have of several ideas fitting together is the pairing of scalable oversight with whitebox methods.Scalable oversight protocols, such as amplification and debate, use AIs to help humans supervise the training of other AIs when the task is too complex to verify directly. It is a blackbox method that produces training signal from observed outputs without looking inside. But it is not enough by itself. AIs could learn the target behaviour too slowly, too unreliably, or find shortcuts that generalise badly. Some of these failures can be caught by looking inside, through whitebox approaches. Mechanistic interpretability is one promising approach here, but it could fail. Fortunately, there are many other bets. Computational mechanics tracks geometries of internal representations in sequence models. Singular Learning theory maps a model’s internals to the data distribution it was trained on, and how that relation varies. Heuristic arguments formalise when subparts of a computation can be assumed independent. Physics-inspired renormalisation-based interpretability applies ideas from field theory.Conversely whitebox methods still need a good/bad blackbox judgement from humans. And different whitebox methods likely have different blackbox needs, so the composition is a boundary to map. With such a map, partial success may still count.One goal of the Alignment Project is to act as glue so that researchers working on one piece know where their work plugs into the others.Spherical cows vs the messEvery field that makes progress on hard problems does so through simplification. Physicists have their spherical cows. Economists have perfectly rational agents. A complexity theorist’s view of machine learning might model tractability but not why non-convex optimisation works in practice; a learning theorist might study Bayesian inference that practical optimisers only approximate.These models are deliberately, usefully wrong. They isolate one feature of a complex system, let us prove things about that feature, and then we check whether the insight survives contact with the full complexity. Alignment needs more spherical cows. The last panel of a Saturday Morning Breakfast Cereal comic “This is What It’s Like to Learn Endocrinology.” It’s meant to be a helpful diagram for learning a complicated topic, but it’s just as complicated.Yet, large-scale neural network training is a mess. Sometimes we must face the mess. Part of what the Alignment Project is for is injecting academic ideas into AI labs and seeing whether they hold up.On the other hand, don’t let anyone tell you the mess is everything. If there are key obstacles in theory those obstacles are somewhere in the mess too. Here is an example: there is a strong drive in ML to automate away human feedback, but humans are still there. Even after automation, researchers still do spot checks on data and model outputs. Human judgement and biases are still very much present, which is a cognitive science problem hiding in the mess.ConclusionThe Alignment Project is funding many disparate bets across different fields, searching for new ideas and tools. Intrinsic in this model is that we do not know which bets will pay off. There is a real chance that none of the bets work out. But even here, there are ways to win. We currently do not know how to solve alignment, but we also do not have widely agreed-upon arguments for whether it is easy or hard. If it turns out to be fundamentally hard and we find evidence of that, we can use it. Knowing where the walls are is not the same as having no options.Technical interventions are not the only path. If the technical picture turns out to be worse than we hope, that is information that matters for policy, for deployment decisions, for how much time we need and what the right societal response should be.AcknowledgementThe talk was given by Geoffrey Irving, who also drafted the original content. This write-up was primarily written by Edmund Lau, with contributions and review from Cameron Holmes and David Africa. Charlie Griffin, Simon Marshall, Will Kirby and Marie Buhl reviewed the draft.^Early pioneers have made progress mapping this space. See for example Bostrom’s Superintelligence (2014), Ngo, Chan and Mindermann’s The Alignment Problem from a Deep Learning Perspective (2022), and Amodei et al.’s Concrete Problems in AI Safety (2016).Discuss ​Read More

Bringing More Expertise to Bear on Alignment

​PreambleThe preamble is less useful for the typical AlignmentForum/LessWrong reader, who may want to skip to Adversaria vs Basinland section.On 28th of October 2025, Geoffrey Irving, Chief Scientist of the UK AI Security Institute, gave a keynote talk (slides) at the Alignment Conference. The conference was organised by the UK AISI and FAR.AI as part of the Alignment Project, which aims to bring experts from relevant fields to make progress on the alignment problem. TLDR:Adversaria vs Basinland. We might be in one of two worlds. One where alignment is adversarial (a security problem), one where it is navigational (a search for good basins of training behaviour). We don’t know which world we are in, and how we train and deploy AIs may determine this.We need new disciplines. The field is small, thinly resourced and approached from only a handful of angles. A few well-placed ideas from other disciplines could disproportionately shift what’s achievable.Even if this all fails, evidence of hardness is valuable. Moving past broad framing to detailsAlignment means ensuring that AI systems do what humans want. This is the broad framing. There is, of course, a lot of complexity and detail hidden in that simple description: Which (subset of) humans? On what topics or domains? What does “want” mean? It is also important to acknowledge that alignment isn’t the only problem we need to solve to ensure AI goes well. We should aim to move from the above broad framing to talk about the details and how they interact, in depth. We may have different takes on the details and it is easy to get blocked by such disagreements, but the different areas of expertise also come with different sets of tools, which often work for many choices of details. We should plan for superintelligenceEven if we are uncertain or may disagree about when or whether superintelligent systems may arrive, we broadly agree that it is wise to plan for it. While this is certainly the “hard case” for alignment, we should be aware that superintelligence is not magic. The non-magic version of superintelligence is:superhuman speed – this is already true for many tasks, even complex ones. hard to interpret – even more so than models today. superhuman heuristics in many domains – see for example AlphaGo, AlphaFold and others. On the other hand, human intelligence stays roughly fixed. As far as we can tell, there isn’t yet a known way for it to scale the way AI capabilities do. Tools and institutions expand what we can do, sometimes dramatically, but they operate on timescales much slower than the AI capability curve.This is a call to take the possibility of superintelligence seriously and set our sights high. This is not a claim that only superintelligence-relevant work matters. Methods and tools that help today, or help up to a known breaking point, still count — especially when we understand the breaking point.Adversaria vs BasinlandOne key uncertainty: how hard is alignment? We don’t know, in the deep sense that there are real model uncertainties on the basic character of the problem. To borrow framing from cryptography and complexity theory: in 1995, Russell Impagliazzo published “A Personal View of Average-Case Complexity” describing five possible worlds depending on which complexity-theoretic assumptions turn out to be true. In Algorithmica, so easily verifiable problems can be efficiently solved, so cryptography is impossible. In Heuristica, hard problems exist in principle but typical instances are easy, so cryptography is still impossible. In Pessiland, even typical instances are hard, but they can’t be harnessed for cryptography, we can generate hard puzzles but not build secrets from them. In Minicrypt, we can hide secrets, but two parties who don’t already share a key cannot establish one. In Cryptomania, all of modern cryptography works, even strangers can establish shared secrets over a public channel.Consider the following parallel for the alignment problem: Adversaria. In this world, the alignment problem is a fundamentally adversarial security problem. Optimisation pressures applied during training and deployment result in capable AI systems that will find and exploit any gap between what we specified and what we actually intended. If the system is smarter than us and has any incentive to subvert our oversight, our alignment solution needs to hold up against worst-case behaviour, not just average-case. In this world, we have a fractal attack surface, and any weak links in our defence can be fatal. Even partial solutions are dangerous and create a false sense of security. This world is rough. A deeper security mindset is more important here than in Basinland (below): we need more time, a far deeper understanding of the foundation and science of intelligence, and a more coordinated, restricted deployment with cyber and information security guarantees. Basinland. In this world, alignment is hard but not adversarial. Training and in-deployment learning have many basins of attraction. If we are near a good basin, automated methods and AIs can pull us the rest of the way to an aligned model. The task is then to identify such good basins and bootstrap ourselves into one. In this world, there is a certain level of error tolerance for the right combination of data, training algorithm, reward, objectives, and oversight for the system to end up in a good region of behaviour-space even under deployment pressures. There may be no single trick that solves the whole problem, but the pieces compose. Multiple partial solutions, each addressing part of the problem, can be combined into something that works, and the character of the difficulty is navigational rather than adversarial: we are searching for the right region, rather than defending against an intelligent attacker.An engineering mindset can prevail in this world: characterise what distinguishes good basins from bad, build the tools to steer training into them, and compose partial solutions into pipelines that collectively work, utilizing defense in depth. Although, monitoring and incident response still matter: Basinland is not safe, only non-adversarial.It is worth noting that being in Basinland and knowing that we are in it are two different things. Which world are we in?We don’t know. We need further progress in understanding intelligence, learning, and oversight to understand what determine the demarcation line between the worlds. As with the aforementioned Impagliazzo’s framing with cryptography, the point was not to pick a world but to make the structure legible: the right research strategy in each world is different, but across the different worlds we can identify the same few load-bearing open questions. Alignment has a similar shape. Adversaria and Basinland are two possible extreme worlds. We may not know for some time where we are, but we can still ask: what are the load-bearing questions whose resolution would tell us? Candidates for such questions include: under what training regimes do capable systems develop robust versus fragile alignment? How does optimisation pressure at deployment interact with specification gaps? Can we find oversight protocols that scale beyond human-level intelligence? These questions may not yet be sharp enough that resolving any one of them would collapse our uncertainty. Sharpening them is itself high-leverage work.[1]This also suggests that techniques that remain useful whether we’re closer to Adversaria or Basinland are more robust bets than techniques that assume one or the other. Interpretability that informs both worst-case defence and basin characterisation is one example. Evaluation protocols that generate evidence that inform us on the basic character of the alignment problem (the “which world” problem itself) are another.Why a few ideas might be enoughThe total investment in alignment research is small. Not small relative to its difficulty but small in absolute terms, on the order of a few thousand person-years of work, most of it concentrated in a handful of approaches. In comparison to other safety critical fields, the effort so far is thin. The question is whether we can substantially expand it: not just by adding more people doing the same things, but also by bringing in fundamentally different approaches. There is reason to think this could work. From the history of theoretical computer science, we find that it is full of problems where a single conceptual breakthrough completely transforms a problem from impossibly hard to easy. For instance, the entire edifice of public-key cryptography rests on a small number of core ideas: the realisation that certain mathematical operations are easy to perform but hard to invert, and that this asymmetry can be used to build secure communication. The PCP theorem, which showed that every proof in can be rewritten so that its correctness can be checked by reading only a constant number of random bits, transformed our understanding of what it means to approximately solve hard problems. The pattern here is that there can be phase changes, not just incremental advances, in what we understand and what we can do. Alignment may look intractable until someone finds the right framing, after which a large class of questions that seemed out of reach become approachable. Conversely, the right framing might reveal that a large class of approaches that seemed promising are fundamentally limited. Either outcome is clarifying.Because alignment has been approached from relatively few angles, the chance of finding a new angle that transforms what’s achievable may be higher than in a more mature field. Complexity theorists, learning theorists, economists, physicists or cognitive scientists may see structure in the alignment problem that we are currently missing. Write down problemsBeing a young field, alignment can be frustrating for researchers accustomed to more mature academic disciplines. Core concepts are fuzzy, definitions are still under construction and the path to positive impact of particular line of research can be unclear. Scientists and researchers are motivated by well-posed problems. Beyond awareness of the alignment problem, we may be bottlenecked by the awareness of the important problems and understand how it connect up to specific toolkits people may bring. If we want to bring more experts to bear on alignment, we should write down problems clearly, concretely and in languages of relevant research disciplines. It also helps to be explicit about what each formalisation captures and what it misses. Multiple new ideas, fitting togetherMapping the blackbox / whitebox boundaryOne new idea probably isn’t enough. A concrete example we have of several ideas fitting together is the pairing of scalable oversight with whitebox methods.Scalable oversight protocols, such as amplification and debate, use AIs to help humans supervise the training of other AIs when the task is too complex to verify directly. It is a blackbox method that produces training signal from observed outputs without looking inside. But it is not enough by itself. AIs could learn the target behaviour too slowly, too unreliably, or find shortcuts that generalise badly. Some of these failures can be caught by looking inside, through whitebox approaches. Mechanistic interpretability is one promising approach here, but it could fail. Fortunately, there are many other bets. Computational mechanics tracks geometries of internal representations in sequence models. Singular Learning theory maps a model’s internals to the data distribution it was trained on, and how that relation varies. Heuristic arguments formalise when subparts of a computation can be assumed independent. Physics-inspired renormalisation-based interpretability applies ideas from field theory.Conversely whitebox methods still need a good/bad blackbox judgement from humans. And different whitebox methods likely have different blackbox needs, so the composition is a boundary to map. With such a map, partial success may still count.One goal of the Alignment Project is to act as glue so that researchers working on one piece know where their work plugs into the others.Spherical cows vs the messEvery field that makes progress on hard problems does so through simplification. Physicists have their spherical cows. Economists have perfectly rational agents. A complexity theorist’s view of machine learning might model tractability but not why non-convex optimisation works in practice; a learning theorist might study Bayesian inference that practical optimisers only approximate.These models are deliberately, usefully wrong. They isolate one feature of a complex system, let us prove things about that feature, and then we check whether the insight survives contact with the full complexity. Alignment needs more spherical cows. The last panel of a Saturday Morning Breakfast Cereal comic “This is What It’s Like to Learn Endocrinology.” It’s meant to be a helpful diagram for learning a complicated topic, but it’s just as complicated.Yet, large-scale neural network training is a mess. Sometimes we must face the mess. Part of what the Alignment Project is for is injecting academic ideas into AI labs and seeing whether they hold up.On the other hand, don’t let anyone tell you the mess is everything. If there are key obstacles in theory those obstacles are somewhere in the mess too. Here is an example: there is a strong drive in ML to automate away human feedback, but humans are still there. Even after automation, researchers still do spot checks on data and model outputs. Human judgement and biases are still very much present, which is a cognitive science problem hiding in the mess.ConclusionThe Alignment Project is funding many disparate bets across different fields, searching for new ideas and tools. Intrinsic in this model is that we do not know which bets will pay off. There is a real chance that none of the bets work out. But even here, there are ways to win. We currently do not know how to solve alignment, but we also do not have widely agreed-upon arguments for whether it is easy or hard. If it turns out to be fundamentally hard and we find evidence of that, we can use it. Knowing where the walls are is not the same as having no options.Technical interventions are not the only path. If the technical picture turns out to be worse than we hope, that is information that matters for policy, for deployment decisions, for how much time we need and what the right societal response should be.AcknowledgementThe talk was given by Geoffrey Irving, who also drafted the original content. This write-up was primarily written by Edmund Lau, with contributions and review from Cameron Holmes and David Africa. Charlie Griffin, Simon Marshall, Will Kirby and Marie Buhl reviewed the draft.^Early pioneers have made progress mapping this space. See for example Bostrom’s Superintelligence (2014), Ngo, Chan and Mindermann’s The Alignment Problem from a Deep Learning Perspective (2022), and Amodei et al.’s Concrete Problems in AI Safety (2016).Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *