Opinion A Tale of Three Contracts AI News Team March 4, 2026 The attempt on Friday by Secretary of War Pete Hegsted to label Anthropic as a…
Opinion An Alignment Journal: Coming Soon AI News Team March 4, 2026 tl;dr We’re incubating an academic journal for AI alignment: rapid peer-review of foundational Alignment research…
Opinion Current activation oracles are hard to use AI News Team March 4, 2026 This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan.tldr;Activation…
Opinion Question: Why is the goal of AI safety not ‘moral machines’? AI News Team March 3, 2026 There is a basic question that has been confusing me for a while that I…
Opinion An Age Of Promethean Ambitions AI News Team March 3, 2026 In a recent post, I wrote the following:[Without] the context of history, we're blind to…
Opinion White-Box Attacks on the Best Open-Weight Model: CCP Bias vs. Safety Training in Kimi K2.5 AI News Team March 3, 2026 Over the last month I have been trying to see just how much I can…
Opinion I Had Claude Read Every AI Safety Paper Since 2020, Here’s the DB AI News Team March 3, 2026 Click here if you just want to see the Database I made of all[1] AI safety…
Opinion Constitutional Black-Box Monitoring for Scheming in LLM Agents AI News Team March 3, 2026 Paper: https://arxiv.org/abs/2603.00829Thread: https://x.com/syghmon/status/2028878121051496674Executive SummaryBlack-box monitors can detect scheming in AI agents using only externally observable…
Opinion In-context learning of representations can be explained by induction circuits AI News Team March 3, 2026 This is a crosspost of my ICLR 2026 blogpost track post. All code and experiments…
Opinion Single Direction vs Low-Rank Refusal in Small LLMs AI News Team March 3, 2026 IntroductionI've recently came across an Alignment Forum post that showed refusal behaviors in LLMs can…