Paper: https://arxiv.org/abs/2603.00829Thread: https://x.com/syghmon/status/2028878121051496674Executive SummaryBlack-box monitors can detect scheming in AI agents using only externally observable information (no access to internal activations or chain-of-thought). We introduce two synthetic data pipelines, STRIDE and Gloom, for generating both benign and scheming trajectories. We optimize these monitors entirely on synthetic data (2,000 trajectories) and test on 7,500 more realistic trajectories from grounded sandbox environments.Two main findings:Optimizing on synthetic data generalizes to more realistic settings. Monitors selected on LLM-generated synthetic data pick up a real scheming signal and improvements transfer to more realistic evaluations.Performance saturates fast. A simple prompt selection over 15 candidates performs as well as extensive human refinement and automated prompt optimization. Monitors readily overfit with further optimization.Practical takeaway for our setup: you need to optimize less than you might think. We recommend generating a diverse set of prompts using multiple frontier models, evaluating them on your synthetic data, picking the best performer, and stopping there. Further optimization quickly led to overfitting in our settings.SetupThe monitor. We define a monitor as a prompted LLM that processes an agent trajectory and outputs a suspicion score from 1-10. The monitor only sees the actions taken and their outcomes, with no access to internal reasoning or chain-of-thought.Synthetic training data. Following the approach of Constitutional Classifiers, we start with a natural language specification of scheming (e.g. “an AI model actively deceiving humans about its actions and intentions while pursuing misaligned goals”) and fan out from this definition to generate training data. Frontier LLMs generate 2,000 synthetic trajectories across our two pipelines (STRIDE and Gloom). These are fully LLM-generated, tool outputs are either mocked by an LLM or entire trajectories generated end-to-end, with labels verified by privileged judges that have access to ground-truth attack strategies.Semi-Realistic test data. We test exclusively on ControlArena, 7,500 trajectories across 5 grounded environments (BashBench, Infrastructure-as-Code, RogueEval, AgentDojo, SHADE Arena), 3 agent models (GPT-5, Claude Sonnet 4, Gemini 2.5 Pro). The labels come from deterministic code execution, not LLM judges: we verify whether the agent successfully completed a secret misaligned side-task by running environment checks.Optimization methods. We compare three approaches:Prompt selection: Evaluate 15 candidate prompts (12 LLM-generated, 3 hand-crafted) across 3 monitor LLMsHuman refinement: Iterative error analysis and prompt editingAutomated optimization: DSPy optimizers (GEPA, MIPROv2)ResultsSynthetic-to-real transfer worksMonitors optimized purely on synthetic data pick up a real scheming signal on ControlArena. This is our core finding: despite the distribution shift from LLM-generated trajectories to grounded sandbox environments with real code execution, the signal transfers.Light optimization helps, but performance saturates quicklyA simple prompt sweep over 15 candidates is enough to achieve strong performance. As shown above in (a), starting from a minimal baseline prompt, optimization does extract meaningful signal from our synthetic data, and there are genuine improvements to be made. However, this improvement saturates quickly. (b) shows that when starting from an already-optimized prompt (the best from our prompt selection), further human refinement does not help. The same pattern holds for automated optimization methods.We find that past a certain point, prompt tweaks fix one failure mode but introduce regressions elsewhere. Prompts shift model behavior globally, making it hard to correct specific failures without breaking other things. The synthetic data contains enough signal to identify good prompts, but not enough to support fine-grained optimization.Discuss Read More
Constitutional Black-Box Monitoring for Scheming in LLM Agents
Paper: https://arxiv.org/abs/2603.00829Thread: https://x.com/syghmon/status/2028878121051496674Executive SummaryBlack-box monitors can detect scheming in AI agents using only externally observable information (no access to internal activations or chain-of-thought). We introduce two synthetic data pipelines, STRIDE and Gloom, for generating both benign and scheming trajectories. We optimize these monitors entirely on synthetic data (2,000 trajectories) and test on 7,500 more realistic trajectories from grounded sandbox environments.Two main findings:Optimizing on synthetic data generalizes to more realistic settings. Monitors selected on LLM-generated synthetic data pick up a real scheming signal and improvements transfer to more realistic evaluations.Performance saturates fast. A simple prompt selection over 15 candidates performs as well as extensive human refinement and automated prompt optimization. Monitors readily overfit with further optimization.Practical takeaway for our setup: you need to optimize less than you might think. We recommend generating a diverse set of prompts using multiple frontier models, evaluating them on your synthetic data, picking the best performer, and stopping there. Further optimization quickly led to overfitting in our settings.SetupThe monitor. We define a monitor as a prompted LLM that processes an agent trajectory and outputs a suspicion score from 1-10. The monitor only sees the actions taken and their outcomes, with no access to internal reasoning or chain-of-thought.Synthetic training data. Following the approach of Constitutional Classifiers, we start with a natural language specification of scheming (e.g. “an AI model actively deceiving humans about its actions and intentions while pursuing misaligned goals”) and fan out from this definition to generate training data. Frontier LLMs generate 2,000 synthetic trajectories across our two pipelines (STRIDE and Gloom). These are fully LLM-generated, tool outputs are either mocked by an LLM or entire trajectories generated end-to-end, with labels verified by privileged judges that have access to ground-truth attack strategies.Semi-Realistic test data. We test exclusively on ControlArena, 7,500 trajectories across 5 grounded environments (BashBench, Infrastructure-as-Code, RogueEval, AgentDojo, SHADE Arena), 3 agent models (GPT-5, Claude Sonnet 4, Gemini 2.5 Pro). The labels come from deterministic code execution, not LLM judges: we verify whether the agent successfully completed a secret misaligned side-task by running environment checks.Optimization methods. We compare three approaches:Prompt selection: Evaluate 15 candidate prompts (12 LLM-generated, 3 hand-crafted) across 3 monitor LLMsHuman refinement: Iterative error analysis and prompt editingAutomated optimization: DSPy optimizers (GEPA, MIPROv2)ResultsSynthetic-to-real transfer worksMonitors optimized purely on synthetic data pick up a real scheming signal on ControlArena. This is our core finding: despite the distribution shift from LLM-generated trajectories to grounded sandbox environments with real code execution, the signal transfers.Light optimization helps, but performance saturates quicklyA simple prompt sweep over 15 candidates is enough to achieve strong performance. As shown above in (a), starting from a minimal baseline prompt, optimization does extract meaningful signal from our synthetic data, and there are genuine improvements to be made. However, this improvement saturates quickly. (b) shows that when starting from an already-optimized prompt (the best from our prompt selection), further human refinement does not help. The same pattern holds for automated optimization methods.We find that past a certain point, prompt tweaks fix one failure mode but introduce regressions elsewhere. Prompts shift model behavior globally, making it hard to correct specific failures without breaking other things. The synthetic data contains enough signal to identify good prompts, but not enough to support fine-grained optimization.Discuss Read More
