Published on January 23, 2026 4:14 PM GMTRecently I’ve been thinking about misaligned chatbot advertising incentives. I glanced at arXiv and found “Sponsored Questions and How to Auction Them”. Another search gave me “Incomplete Contracting and AI Alignment”.Interesting! I thought. I gave them to Liz Lemma, my research assistant, and told her that I’d been thinking about the principal-agent problem in a chatbot context. About 30 minutes later she gave me the following four papers:Query Steering in Agentic Search: An Information-Design Model of Monetization MisalignmentSearch Triggering by Chatbots as Value-of-Information under Misaligned ObjectivesAudited Search for Agentic Chatbots: Quantitative Bounds on Monetization-Induced Over-TriggeringEnd-to-End vs. Modular Training for Agentic Search: A Price-of-Anarchy Theory for Chatbot Tool UseEach is a complete paper, well founded, well reasoned — not perfect, maybe, but I wouldn’t call it slop, either. Let’s peek inside of “Query Steering”. The core formula is “The Steering Threshold”:ΔV(μ)≤wΔBWhere:ΔV(μ)=Vu(μ;q↓)−Vu(μ;q↑) is the user’s value gap between the more-informative query (q↓) and the more-monetizable (but less informative) query (q↑).ΔB=B(q↑)−B(q↓)>0 is the monetization gap.w≥0 is “how much the system cares about monetization.”The clean “steering region” characterization is:0<ΔV(μ)<wΔB“Steering happens exactly when the user loss is small enough that monetization incentives can overpower it.”Is that true, or useful? You’d have to read the paper and find out!Putting the “Search” in “Research”Liz Lemma, you may have guessed, is an automated research assistant. She reads papers and spawns reasonable children. The whole thing can be thought of as a search over the space of adjacent plausible papers. Here’s what it looks like when Liz gets to work:Each original node is the average text embedding for her sources; the sources spawn children, the generated papers.Where does Liz get her insights? It depends on how you see context in large language models. Maybe she’s interpolating between now, the time of writing, and then, when the paper was written, and finding something interesting in the space between. Maybe she’s matching concepts from the whole-internet corpus of her training to the decidedly more niche papers she takes inspiration from.Regardless, what you find ultimately is a graph: for each node in the network comprised of the source material, the concatenated text embeddings, you have new connections. The semantic space proximate and accessible and plausible to a language model has been densified.Alignment HighlightsMisaligned chatbot ad incentives investigated, I turned towards more traditional alignment research, and let Liz loose. I’ve included all the sources and results in the appendix below, but some highlights:Vector-Lagrangian Safe RLHF. In response to “Constitutional AI” and “Safe RLHF”, Liz proposes “a multi-constraint formulation where each harm category is a separate constraint with its own endogenously determined shadow price λi” and “develop[s] a clean convex policy model that yields closed-form KKT characterizations and interprets λ as category-specific ‘risk budget prices’.”The Reference Conditioning Trap “provide[s] an interpretable diagnostic for when alignment investment must shift from ‘more DPO’ to improving the reference, refreshing the dataset on-policy, or redesigning objectives.”Debate-as-Compliance “model[s] debate as generating an endogenous suspicion signal (abort/disagreement) that triggers expert audit” and “show[s] how penalty caps and audit costs shape optimal audit intensity, and how improvements in auditability/stability (e.g., better logging, more locally checkable traces) substitute for random auditing”All sounds great to me. Alignment solved!Not quite. These papers are not perfect. There are assumptions that don’t hold up, ideas that don’t quite translate, conclusions that don’t follow. They are, in short, flawed, and I don’t offer them as gospel truth. Instead, I propose that they have some expected value > 0 towards work that might not be done otherwise. The challenging part is “what’s next” – how can we glean the signal from the noise of lightning fast paper generation? The pipeline might be:Generated paper -> voters -> selected winnersWinner -> annotator -> revisions -> approval by committeeMaybe those voters, those annotators, that committee, are human, maybe they’re synthetic. At this point, teetering on the edge of generally-correct AI, we still probably need (or want) a human intervention in the process.There’s refinement to be done. But it seems plausible that the space of unaddressed alignment questions can shrink, in tandem with capabilities expansion.Appendix: Sources and Results(You can see all papers, with automated review scores, at lizlemma.com.)In response to The Alignment Problem From A Deep Learning Perspective:Verifiable Process Certificates As Market Design For AI SafetyAudits Obfuscations And Alignment-FakingMeasuring Strategic Misbehavior Under Randomized OversightIn response to Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals and Preference Learning for AI Alignment: a Causal Perspective:No Overlap, No Alignment: Identification Limits for RLHF Reward Models under Endogenous PromptsOverlap as a Design Variable: Optimal Interventions for Robust Reward Modeling and Downstream AlignmentAlignment is a Public Good: Competitive Underinvestment in Overlap for RLHF-Robust Agentic LLMsIn response to Constitutional AI: Harmlessness from AI Feedback and Safe RLHF: Safe Reinforcement Learning from Human Feedback:Vector-Lagrangian Safe RLHF: Multi-Category Risk Budgets and Shadow Prices for LLM Safety GovernanceThe Independence Premium: Audit Portfolios, Correlated Blind Spots, and Tail-Risk Bounds in AI OversightRefusal Externalities: Non-Evasive Refusals as an Information Public Good in Dynamic AI SafetyIn response to Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Preference Learning for AI Alignment: a Causal Perspective, and Scalable AI Safety via Doubly-Efficient DebateCausal Direct Preference Optimization with Endogenous PromptsOptimal Overlap Design for Preference Learning: Quadrant-Filling Interventions and Condition-Number Sample ComplexityPersonalized Alignment with Guarantees: Multi-Objective Constrained DPO under Endogenous Prompts and Fairness CapsIn response to Scalable AI Safety via Doubly-Efficient Debate:Robust Doubly-Efficient Debate with Correlated, Contaminated Human JudgmentDebate-Triggered Audits for Tool-Using Agents: Closed-Form Oversight Savings, Identification, and 2026-Scale EvidenceDebate-as-Compliance: Audit-Triggered Mechanisms for Constant-Cost Oversight of Agentic AIIn response to Preference Learning Algorithms Do Not Learn Preference Rankings and On Scalable Oversight with Weak LLMs Judging Strong LLMs:Debate as Anti-Amplification: Competitive Persuasion with Verifiable Evidence and Weak JudgesOptimal Auditing for Alignment: An Audit Index for Label Allocation under Reference-Conditioned Preference OptimizationThe Reference-Conditioning Trap: A Population Bound on Mis-Ranking Corrections under Fixed-Reference DPODiscuss Read More
Automated Alignment Research, Abductively
Published on January 23, 2026 4:14 PM GMTRecently I’ve been thinking about misaligned chatbot advertising incentives. I glanced at arXiv and found “Sponsored Questions and How to Auction Them”. Another search gave me “Incomplete Contracting and AI Alignment”.Interesting! I thought. I gave them to Liz Lemma, my research assistant, and told her that I’d been thinking about the principal-agent problem in a chatbot context. About 30 minutes later she gave me the following four papers:Query Steering in Agentic Search: An Information-Design Model of Monetization MisalignmentSearch Triggering by Chatbots as Value-of-Information under Misaligned ObjectivesAudited Search for Agentic Chatbots: Quantitative Bounds on Monetization-Induced Over-TriggeringEnd-to-End vs. Modular Training for Agentic Search: A Price-of-Anarchy Theory for Chatbot Tool UseEach is a complete paper, well founded, well reasoned — not perfect, maybe, but I wouldn’t call it slop, either. Let’s peek inside of “Query Steering”. The core formula is “The Steering Threshold”:ΔV(μ)≤wΔBWhere:ΔV(μ)=Vu(μ;q↓)−Vu(μ;q↑) is the user’s value gap between the more-informative query (q↓) and the more-monetizable (but less informative) query (q↑).ΔB=B(q↑)−B(q↓)>0 is the monetization gap.w≥0 is “how much the system cares about monetization.”The clean “steering region” characterization is:0<ΔV(μ)<wΔB“Steering happens exactly when the user loss is small enough that monetization incentives can overpower it.”Is that true, or useful? You’d have to read the paper and find out!Putting the “Search” in “Research”Liz Lemma, you may have guessed, is an automated research assistant. She reads papers and spawns reasonable children. The whole thing can be thought of as a search over the space of adjacent plausible papers. Here’s what it looks like when Liz gets to work:Each original node is the average text embedding for her sources; the sources spawn children, the generated papers.Where does Liz get her insights? It depends on how you see context in large language models. Maybe she’s interpolating between now, the time of writing, and then, when the paper was written, and finding something interesting in the space between. Maybe she’s matching concepts from the whole-internet corpus of her training to the decidedly more niche papers she takes inspiration from.Regardless, what you find ultimately is a graph: for each node in the network comprised of the source material, the concatenated text embeddings, you have new connections. The semantic space proximate and accessible and plausible to a language model has been densified.Alignment HighlightsMisaligned chatbot ad incentives investigated, I turned towards more traditional alignment research, and let Liz loose. I’ve included all the sources and results in the appendix below, but some highlights:Vector-Lagrangian Safe RLHF. In response to “Constitutional AI” and “Safe RLHF”, Liz proposes “a multi-constraint formulation where each harm category is a separate constraint with its own endogenously determined shadow price λi” and “develop[s] a clean convex policy model that yields closed-form KKT characterizations and interprets λ as category-specific ‘risk budget prices’.”The Reference Conditioning Trap “provide[s] an interpretable diagnostic for when alignment investment must shift from ‘more DPO’ to improving the reference, refreshing the dataset on-policy, or redesigning objectives.”Debate-as-Compliance “model[s] debate as generating an endogenous suspicion signal (abort/disagreement) that triggers expert audit” and “show[s] how penalty caps and audit costs shape optimal audit intensity, and how improvements in auditability/stability (e.g., better logging, more locally checkable traces) substitute for random auditing”All sounds great to me. Alignment solved!Not quite. These papers are not perfect. There are assumptions that don’t hold up, ideas that don’t quite translate, conclusions that don’t follow. They are, in short, flawed, and I don’t offer them as gospel truth. Instead, I propose that they have some expected value > 0 towards work that might not be done otherwise. The challenging part is “what’s next” – how can we glean the signal from the noise of lightning fast paper generation? The pipeline might be:Generated paper -> voters -> selected winnersWinner -> annotator -> revisions -> approval by committeeMaybe those voters, those annotators, that committee, are human, maybe they’re synthetic. At this point, teetering on the edge of generally-correct AI, we still probably need (or want) a human intervention in the process.There’s refinement to be done. But it seems plausible that the space of unaddressed alignment questions can shrink, in tandem with capabilities expansion.Appendix: Sources and Results(You can see all papers, with automated review scores, at lizlemma.com.)In response to The Alignment Problem From A Deep Learning Perspective:Verifiable Process Certificates As Market Design For AI SafetyAudits Obfuscations And Alignment-FakingMeasuring Strategic Misbehavior Under Randomized OversightIn response to Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals and Preference Learning for AI Alignment: a Causal Perspective:No Overlap, No Alignment: Identification Limits for RLHF Reward Models under Endogenous PromptsOverlap as a Design Variable: Optimal Interventions for Robust Reward Modeling and Downstream AlignmentAlignment is a Public Good: Competitive Underinvestment in Overlap for RLHF-Robust Agentic LLMsIn response to Constitutional AI: Harmlessness from AI Feedback and Safe RLHF: Safe Reinforcement Learning from Human Feedback:Vector-Lagrangian Safe RLHF: Multi-Category Risk Budgets and Shadow Prices for LLM Safety GovernanceThe Independence Premium: Audit Portfolios, Correlated Blind Spots, and Tail-Risk Bounds in AI OversightRefusal Externalities: Non-Evasive Refusals as an Information Public Good in Dynamic AI SafetyIn response to Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Preference Learning for AI Alignment: a Causal Perspective, and Scalable AI Safety via Doubly-Efficient DebateCausal Direct Preference Optimization with Endogenous PromptsOptimal Overlap Design for Preference Learning: Quadrant-Filling Interventions and Condition-Number Sample ComplexityPersonalized Alignment with Guarantees: Multi-Objective Constrained DPO under Endogenous Prompts and Fairness CapsIn response to Scalable AI Safety via Doubly-Efficient Debate:Robust Doubly-Efficient Debate with Correlated, Contaminated Human JudgmentDebate-Triggered Audits for Tool-Using Agents: Closed-Form Oversight Savings, Identification, and 2026-Scale EvidenceDebate-as-Compliance: Audit-Triggered Mechanisms for Constant-Cost Oversight of Agentic AIIn response to Preference Learning Algorithms Do Not Learn Preference Rankings and On Scalable Oversight with Weak LLMs Judging Strong LLMs:Debate as Anti-Amplification: Competitive Persuasion with Verifiable Evidence and Weak JudgesOptimal Auditing for Alignment: An Audit Index for Label Allocation under Reference-Conditioned Preference OptimizationThe Reference-Conditioning Trap: A Population Bound on Mis-Ranking Corrections under Fixed-Reference DPODiscuss Read More
