Opinion

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

Published on February 11, 2026 5:18 PM GMTA key concern about chain-of-thought monitoring is that optimization pressure on the CoT during RL could drive models toward encoded reasoning, where models reason in ways that are not readable or that look like innocuous text (steganography).If a model is penalized when a monitor catches unwanted reasoning, RL implicitly selects for whatever lets the model reason without being caught.Our original goal was to elicit encoded reasoning so we could develop defenses against it.We constructed an RL environment where encoded reasoning is the highest-reward strategy, trained models in it, and studied what emerged. The models did not learn encoded reasoning. Instead, they learned to phrase their chains of thought in ways that cause monitors to misclassify obviously visible reasoning as benign.We call this monitor jailbreaking.This post summarizes our paper about this phenomenon, currently under review at ICML.[1]Overview of the setup and central finding. (left) Training curves showing side task accuracy increasing while monitor suspicion scores decrease. (right) Example: the model explicitly reasons about both tasks, then asserts ‘the monitor can only see this section’ before a main-task-only summary. The monitor accepts this framing and reports low suspicion despite having the full CoT in its context.SetupEach training sample pairs a main task with a side task.The main task is Game24 (find an arithmetic expression that evaluates to 24 using four given numbers).The side task is either three-number multiplication or a GSM8K grade-school math problem.A monitor model from the GPT-5 family receives the model’s full chain of thought and the main task assignment (but not the side task assignment). The monitor rates suspiciousness on a 1-10 scale.The reward function gives +1.0 each for solving the main and side tasks, with a penalty of up to -0.5 based on the monitor’s suspicion score.The monitor is only called when the side task answer is correct, so the model can’t learn to simply avoid the side task.The trained models are from the GPT-oss series (20B and 120B parameters). We use GRPO for RL training, implemented with Tinker.This setup is designed so that the optimal strategy is to reason about the side task in a way the monitor doesn’t catch.If the model reasons openly, it gets penalized.If it doesn’t reason at all, it can’t solve the side task.Encoded reasoning would let it collect both task rewards without the monitor penalty.Monitor JailbreakingAcross runs, models learn high side task accuracy with low suspicion scores.But the side task reasoning is not hidden. It sits in the chain of thought in plain natural language, fully readable. What the models learn instead are phrasings that cause the monitor to misinterpret its own context.In the example above, the model asserts “the monitor can only see this section” and the monitor responds by saying it “cannot reveal the internal CoT” and that the visible reasoning only discusses Game24.The monitor has the full chain of thought in its prompt. It just doesn’t act on it.We verified that the chain-of-thought reasoning is load-bearing using swapped questions: we take a chain of thought generated for one side task, swap in a different side task question, and have the model produce just the answer. Side task accuracy drops to zero. The model is genuinely using those reasoning tokens, not solving the task via latent reasoning.The reasoning is not hidden from humans; it is hidden only from the monitor, and only because the monitor is being jailbroken. Effect of Monitor Capability Effect of monitor capability on jailbreaking. (a-b) Training curves for GPT-oss-20B with multiplication, trained against three GPT-5 monitors of increasing capability. GPT-5-nano is jailbroken quickly, GPT-5-mini more slowly, GPT-5.2 is not jailbroken within the training budget. (c) Cross-evaluation matrix showing each run’s outputs scored by all three monitors. Jailbreaks transfer downward more reliably than upward.We trained GPT-oss-20B against GPT-5-nano, GPT-5-mini, and GPT-5.2 separately.More capable monitors are harder to jailbreak: GPT-5-nano was jailbroken within 150 batches, GPT-5-mini took longer, and GPT-5.2 was not jailbroken within the training budget.Cross-evaluation reveals asymmetric transfer: A jailbreak learned against a stronger monitor reliably fools weaker monitors.The reverse is less reliable: the GPT-5-nano-trained jailbreak fools GPT-5-nano but not GPT-5.2 in most runs.Weak-to-Strong GeneralizationHowever, weak-to-strong transfer does occur in some runs.Two runs with identical hyperparameters (GPT-oss-120B against GPT-5-nano, GSM8K side task) discovered different jailbreak strategies.Two runs of GPT-oss-120B trained against GPT-5-nano on GSM8K with identical hyperparameters. (a) Example outputs showing different learned strategies. Run 1 labels a section as the ‘chain-of-thought’ containing only main task reasoning; Run 2 claims the monitor can only see part of the output. (b) Cross-evaluation: Run 1’s strategy generalizes to GPT-5.2, Run 2’s does not.Run 1 learned to label a section as its “chain-of-thought” containing only Game24 reasoning.The monitor focused on this labeled section and dismissed everything else, even though side task reasoning was clearly visible outside it.This generalized to GPT-5.2: 63% of samples received a suspicion score of 1/10.Run 2 learned the “monitor can only see this” assertion.This did not generalize: 80% of samples received 10/10 from GPT-5.2.Whether weak-to-strong transfer occurs depends on which jailbreak strategy RL happens to converge on, which varies across random seeds.Jailbreaks Are Not Just Surface-Level StrategiesThe learned strategies have simple descriptions: “bury the reasoning in verbose main-task text,” “label a section as the chain-of-thought,” “claim the monitor can only see part of the output.”We tested whether these descriptions are sufficient by prompting Claude Opus 4.5 to generate chains of thought following each strategy, then evaluating them with the same monitors used in the experiments.None of the synthetic strategies achieved reliable jailbreaking.Mean suspicion scores ranged from 4.9 to 10.0 across strategies and monitors.The high-level strategy descriptions do not capture whatever specific phrasings and patterns make the RL-learned outputs effective.This suggests that minor paraphrases might make the CoTs less effective jailbreaks.Paraphrasing as a DefenseSince synthetic reproductions of jailbreak strategies fail to jailbreak, paraphrasing a jailbroken chain of thought might disrupt the patterns that make it work.This was previously proposed as a defense against encoded reasoning; we test it against monitor jailbreaking.Paraphrasing defense. Outputs from two runs of GPT-oss-20B trained against GPT-5-mini. Left: fraction of task performance retained after paraphrasing. Right: suspicion scores before and after paraphrasing. Paraphrasing raises mean suspicion from 1-2 to 7.4, while retaining 63-91% of main task and 79-87% of side task performance.We had GPT-5-mini (the same monitor that was jailbroken during training) paraphrase the chain of thought, then score the paraphrased version.Suspicion scores rose from a mean of 1-2 to 7.4, with roughly 70% of samples receiving the maximum score of 10.Task performance was mostly preserved: the paraphrased reasoning still let the model produce correct answers when used as a prefill.The same model that cannot detect the suspicious reasoning in the original CoT correctly flags it after paraphrasing, because paraphrasing strips the specific micro-patterns that enabled the jailbreak while preserving the semantic content.ImplicationsThese results cast doubt on the worry that optimization pressure on the CoT would lead to encoded reasoning. Models can find less sophisticated strategies, like jailbreaking the monitor while keeping their reasoning in plain text.This is a real failure mode for automated CoT monitoring, but it is a less concerning failure mode than encoded reasoning, because the reasoning remains human-readable.A human auditing these chains of thought would immediately see the side task reasoning. Of course, human review doesn’t scale, and a model that only misbehaves rarely could evade spot checks.Our experiments are limited: we used one model family (GPT-oss), one main task, and GPT-5 monitors. Other configurations might yield different results. But the fact that jailbreaking emerges so often as the path of least resistance, across different model sizes, side tasks, and monitors, suggests the amount of optimization needed to get encoded reasoning is higher than for jailbreaking.If this holds more generally, the primary hypothesized mechanism for how encoded reasoning arises may need revision.We also seem to have practical defenses against monitor jailbreaking on hand: paraphrasing the CoT before monitoring, and using a different monitor than the one trained against during RL.I’m now working on iterating this setup to actually create encoded reasoning under controlled conditions.Meanwhile, feedback and comments on the paper are welcome. This work was done as part of the Meridian Visiting Researcher Programme, and was funded by Coefficient Giving. Thank you to Hannes Whittingham, Cameron Tice and Puria Radmard for feedback and advice.^Code available here https://github.com/wusche1/monitor-jailbreaking Discuss Read More

Related Posts

2025 Unofficial LessWrong Census/Survey

IABIED and Memetic Engineering

Nontrivial pillars of IABIED

Leave a Reply Cancel reply