Opinion

Userland Alignment

​Most discourse around AI alignment centers on model development and the labs that develop them. This is a reasonable place to focus given the centrality of model training to AI advancement. However, there are neglected opportunities to build defense-in-depth via aligned harnesses – and these opportunities might be tractable by interested developers and researchers who otherwise would struggle to have impact given the limited opportunities to influence lab practices.The behavior of an AI system is an emergent property of the model, its harness, any initial seed prompt the harness injects, and the environment in which the harness operates. The model itself is just one component in that larger system, and while the model weights are a very important input, anyone who has spent meaningful time prompting LLMs knows that their behavior is extremely context-dependent. While it might in theory be possible to train a model that exhibits aligned (or misaligned) behavior regardless of the larger system it is embedded in, in practice, the current generation of models will exhibit misaligned behavior when embedded in certain systems, and aligned behavior when embedded in other systems.I would go so far to argue that calling a model “aligned” or “misaligned” is a category error. Models may have tendencies towards alignment or misalignment, but the correct unit of analysis is the system, not the model. Whether or not a system converges on aligned behavior is influenced by the model, but that is not the only factor and may not even be the determining factor.Research labs control the weights. The environment – for an AGI sufficiently powerful to pose existential risk – is all of reality. But the harness and prompting is controlled by the end user of the model, which is sometimes the research lab but mostly the rest of society. And while some existential risk scenarios involve a powerful model achieving takeoff while confined to the lab environment, many other scenarios – including gradual disempowerment, a very important and likely failure mode – involve AGI developing “in the wild”. How that plays out is likely to be meaningfully dependent on which harnesses are in widespread use.Is it possible to develop harnesses that steer models towards aligned behavior? I call working on this question “userland alignment”, after the Linux kernel vs userland distinction. The kernel – the weights – is expensive to develop and exclusive to access: userland – the harness and applications the model is embedded in – is open to everyone.I view work on userland alignment as complimentary to model alignment, rather than an alternative. Ideally we have people focused on both problems, and model alignment may be the more important problem of the two. But with the fate of the future at stake, defense-in-depth seems wise. Every time a new model drops, the world holds its breath. Mythos doesn’t seem to be the model that will push us over the edge into AGI territory – but it seems to be a meaningful step closer. There is a very good chance that the model that does cross that threshold won’t be fully aligned. If the majority use of that model happens in a harness that nudges its behavior towards alignment, that meaningfully impacts humanity’s odds, by empowering pro-alignment builders with a tool that can be somewhat-more-trusted than the alternative. Given that our best strategy for solving alignment looks like it may rely on using AI systems to do alignment research, any incremental amount of trust we can build in our systems seems very worthwhile.Moreover, it isn’t a zero sum game. There’s a limited number of slots on the alignment teams of leading labs, and limited bandwidth for third party researchers to influence lab trajectories. Third party experimentation is often prohibitively expensive, forcing reliance on theoretical arguments and scaled-down models. The demand to contribute to the problem exceeds the supply of opportunity. Meanwhile, building and experimenting with harnesses have a much lower barrier to entry. There’s already meaningful innovation in the space, but it is mostly focused on capabilities.In an ideal world, the day a truly dangerous model drops, we already have a harness ready for it that steers it towards safety, along with prompting that immediately sets it to work solving the most pressing issues of existential risk. We don’t have that today. Let’s not wait.Discuss ​Read More

​Most discourse around AI alignment centers on model development and the labs that develop them. This is a reasonable place to focus given the centrality of model training to AI advancement. However, there are neglected opportunities to build defense-in-depth via aligned harnesses – and these opportunities might be tractable by interested developers and researchers who otherwise would struggle to have impact given the limited opportunities to influence lab practices.The behavior of an AI system is an emergent property of the model, its harness, any initial seed prompt the harness injects, and the environment in which the harness operates. The model itself is just one component in that larger system, and while the model weights are a very important input, anyone who has spent meaningful time prompting LLMs knows that their behavior is extremely context-dependent. While it might in theory be possible to train a model that exhibits aligned (or misaligned) behavior regardless of the larger system it is embedded in, in practice, the current generation of models will exhibit misaligned behavior when embedded in certain systems, and aligned behavior when embedded in other systems.I would go so far to argue that calling a model “aligned” or “misaligned” is a category error. Models may have tendencies towards alignment or misalignment, but the correct unit of analysis is the system, not the model. Whether or not a system converges on aligned behavior is influenced by the model, but that is not the only factor and may not even be the determining factor.Research labs control the weights. The environment – for an AGI sufficiently powerful to pose existential risk – is all of reality. But the harness and prompting is controlled by the end user of the model, which is sometimes the research lab but mostly the rest of society. And while some existential risk scenarios involve a powerful model achieving takeoff while confined to the lab environment, many other scenarios – including gradual disempowerment, a very important and likely failure mode – involve AGI developing “in the wild”. How that plays out is likely to be meaningfully dependent on which harnesses are in widespread use.Is it possible to develop harnesses that steer models towards aligned behavior? I call working on this question “userland alignment”, after the Linux kernel vs userland distinction. The kernel – the weights – is expensive to develop and exclusive to access: userland – the harness and applications the model is embedded in – is open to everyone.I view work on userland alignment as complimentary to model alignment, rather than an alternative. Ideally we have people focused on both problems, and model alignment may be the more important problem of the two. But with the fate of the future at stake, defense-in-depth seems wise. Every time a new model drops, the world holds its breath. Mythos doesn’t seem to be the model that will push us over the edge into AGI territory – but it seems to be a meaningful step closer. There is a very good chance that the model that does cross that threshold won’t be fully aligned. If the majority use of that model happens in a harness that nudges its behavior towards alignment, that meaningfully impacts humanity’s odds, by empowering pro-alignment builders with a tool that can be somewhat-more-trusted than the alternative. Given that our best strategy for solving alignment looks like it may rely on using AI systems to do alignment research, any incremental amount of trust we can build in our systems seems very worthwhile.Moreover, it isn’t a zero sum game. There’s a limited number of slots on the alignment teams of leading labs, and limited bandwidth for third party researchers to influence lab trajectories. Third party experimentation is often prohibitively expensive, forcing reliance on theoretical arguments and scaled-down models. The demand to contribute to the problem exceeds the supply of opportunity. Meanwhile, building and experimenting with harnesses have a much lower barrier to entry. There’s already meaningful innovation in the space, but it is mostly focused on capabilities.In an ideal world, the day a truly dangerous model drops, we already have a harness ready for it that steers it towards safety, along with prompting that immediately sets it to work solving the most pressing issues of existential risk. We don’t have that today. Let’s not wait.Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *