Opinion

You’re absolutely right, Senator. I was being naive about the political reality.

​Epistemic status: pattern I keep seeing in my work. I work on building pipelines where LLMs generate formal assertions from natural language specs and I think a lot about what happens when we knotify [1] loops between human intent and machine output. Confidence in the observation is high, but the confidence in the proposed framing is medium. ~~~~~~LLMs encode simplified human models, by compressing large amounts of human-produced text into lower-dimensional approximations of “what humans think like”. People are then integrating AI outputs as their own positions, especially if the output is genuinely well-constructed and confirms their priors. People in governance positions are doing it (sometimes on camera), many are watching, and nobody is building a breaker.This builds a loop that’s constraining human complexity (irreducible) into complicated (lots of moving parts, in principle reducible) models.This loop worries me partly because humans are already bad at recognizing value in the first place. Imagine for a moment the internals of a human deciding to change a name such as Department of Defense to Department of War (aka now proudly hosted at war.gov). I’d bet some misfiring of internals happened there and if the felt sense of good can misfire at that scale, it can misfire anywhere [2].I’m not sure how common or how spread out this is, but I’ve heard “even AI agrees” a non-zero amount of times in my social bubbles. If we take a system’s output and use it as apparent objectivity, I’d at least wish we do it better[3].The alignment community has proposed circuit breakers at the model level: constitutional AI, scalable oversight, mech interp-based monitoring, all as attempts to ensure the model behaves well, but somehow, through the nature of our society, the failure mode I’m describing doesn’t require the model to behave badly. The model can be perfectly well-calibrated, honest, and non-sycophantic by the subset of metrics we manage to set on it. Nevertheless, the loop still forms. Here’s why I think this to be the case:Sycophancy can be a quasi-property of the medium. If every output reads like it was written by a smarter version of self, one may integrate it as a self-generated thought whether or not it technically disagrees on specifics.Even if the model flags uncertainty or disagreement, the user curates what they present. “AI helped me draft this” becomes “Analysis shows that” or questions like “Was this vibecoded?” get answered with “Less than 50% and only where the code was too bad to go through by myself [4]”. What model-level interventions prevent this type repackaging?Scalable oversight is designed for scenarios where the AI is the threat. But what abou the cases where the human and the AI are co-producing the failure? Human wants confirmation; these systems provide it; institutions reward decisiveness. Oddly aligned.I’m working in a job that’s supposed to replace humans with AI. I’m part of the problem, though I spend more of my thinking power on figuring out where humans must be part of whatever process we’re trying to automatize. I deal with the gap between verification (do we build the thing right?) and validation (do we build the right thing?).[5] In this gap, I try to model explicitly how humans are needed for grounding relative units of AI output. As of today, the sensefull take is that AI outputs remain underdetermined in quality until a human applies judgment.The alignment community has spent enormous effort on the question “what if AI doesn’t do what we want?” I think we need equal effort on the complementary question: what if AI does exactly what we want, and that’s the problem? I see we’re sliding towards self-fulfilling prophecies and I’m wondering: how do we break out?Eager to be made lesswrong.^By knotify I mean a feedback loop that ties itself into a structure that’s too spaghetti to untangle easily. ^Another example of misfiring happened during the agreements with the DoW.^I’m under the impression that “better” currently involves formalization of the mathematical kind. I see its breaking points. If not the one, at least one of the better path towards it.^Heard that one this week in a meeting.^I also expand it towards a mutually thriving direction, where I keep track of “do we build the good thing?”, with a metric that accounts for externalities across agents (self x others) and time horizons (now x future).Discuss ​Read More

​Epistemic status: pattern I keep seeing in my work. I work on building pipelines where LLMs generate formal assertions from natural language specs and I think a lot about what happens when we knotify [1] loops between human intent and machine output. Confidence in the observation is high, but the confidence in the proposed framing is medium. ~~~~~~LLMs encode simplified human models, by compressing large amounts of human-produced text into lower-dimensional approximations of “what humans think like”. People are then integrating AI outputs as their own positions, especially if the output is genuinely well-constructed and confirms their priors. People in governance positions are doing it (sometimes on camera), many are watching, and nobody is building a breaker.This builds a loop that’s constraining human complexity (irreducible) into complicated (lots of moving parts, in principle reducible) models.This loop worries me partly because humans are already bad at recognizing value in the first place. Imagine for a moment the internals of a human deciding to change a name such as Department of Defense to Department of War (aka now proudly hosted at war.gov). I’d bet some misfiring of internals happened there and if the felt sense of good can misfire at that scale, it can misfire anywhere [2].I’m not sure how common or how spread out this is, but I’ve heard “even AI agrees” a non-zero amount of times in my social bubbles. If we take a system’s output and use it as apparent objectivity, I’d at least wish we do it better[3].The alignment community has proposed circuit breakers at the model level: constitutional AI, scalable oversight, mech interp-based monitoring, all as attempts to ensure the model behaves well, but somehow, through the nature of our society, the failure mode I’m describing doesn’t require the model to behave badly. The model can be perfectly well-calibrated, honest, and non-sycophantic by the subset of metrics we manage to set on it. Nevertheless, the loop still forms. Here’s why I think this to be the case:Sycophancy can be a quasi-property of the medium. If every output reads like it was written by a smarter version of self, one may integrate it as a self-generated thought whether or not it technically disagrees on specifics.Even if the model flags uncertainty or disagreement, the user curates what they present. “AI helped me draft this” becomes “Analysis shows that” or questions like “Was this vibecoded?” get answered with “Less than 50% and only where the code was too bad to go through by myself [4]”. What model-level interventions prevent this type repackaging?Scalable oversight is designed for scenarios where the AI is the threat. But what abou the cases where the human and the AI are co-producing the failure? Human wants confirmation; these systems provide it; institutions reward decisiveness. Oddly aligned.I’m working in a job that’s supposed to replace humans with AI. I’m part of the problem, though I spend more of my thinking power on figuring out where humans must be part of whatever process we’re trying to automatize. I deal with the gap between verification (do we build the thing right?) and validation (do we build the right thing?).[5] In this gap, I try to model explicitly how humans are needed for grounding relative units of AI output. As of today, the sensefull take is that AI outputs remain underdetermined in quality until a human applies judgment.The alignment community has spent enormous effort on the question “what if AI doesn’t do what we want?” I think we need equal effort on the complementary question: what if AI does exactly what we want, and that’s the problem? I see we’re sliding towards self-fulfilling prophecies and I’m wondering: how do we break out?Eager to be made lesswrong.^By knotify I mean a feedback loop that ties itself into a structure that’s too spaghetti to untangle easily. ^Another example of misfiring happened during the agreements with the DoW.^I’m under the impression that “better” currently involves formalization of the mathematical kind. I see its breaking points. If not the one, at least one of the better path towards it.^Heard that one this week in a meeting.^I also expand it towards a mutually thriving direction, where I keep track of “do we build the good thing?”, with a metric that accounts for externalities across agents (self x others) and time horizons (now x future).Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *