Opinion

Teaching Models to Dream of Better Monitors through Monitor Sensitive Training

​TL;DRWe introduce Monitor Sensitive Training (MST), a new post-training technique where we augment training data with monitor labels that describe how evaluation is going to be applied for each sample. We then change these labels in deployment in order to steer model generalization toward behaviors that are more aligned than we can train for directly.We show that MST can be used to reduce political bias and sycophancy in simple proof-of-concept experiments, showcasing its ability to improve reward specification and decrease alignment fragility.We want feedback on our approach and proposals for interesting experiments!What is MST?Feedback (reward signal in RL, sample responses in SFT, etc.) is a primary driver of model behavior and our main lever for influencing model alignment. However, operationalizing alignment in terms of a feedback mechanism is a difficult open problem. Methods such as RLHF can suffer from misspecification and fragility depending on the training data. MST seeks to address feedback quality by sidestepping the central issue. Rather than creating a feedback mechanism that fits the model directly on imperfect and underspecified data, we give each training input a monitor label, a description of the feedback mechanism we are using for that sample. We then vary the feedback mechanism and associated monitor label in order to contextualize training via the monitor label. Our hypothesis is that models will learn to alter their behavior according to what maximizes the measurement of the objective described in the monitor label and that this behavior will generalize to novel monitor labels we utilize in deployment. If our hypothesis is correct, then MST will allow us to describe a standard of behavior that would have been infeasible to directly train for and elicit generalizations that reliably fit this regime.The case for MST relies on the following premises:It’s easier to make a description reliably reflect tasks (or training data) than to make a task reliably instill intended behavior.Highly intelligent models have latent knowledge that allows for aligned generalizations if only we elicit them via our training objective.If these two premises are true, then MST has the potential to provide a significant improvement over typical post-training alignment methods because monitor labels are easier to accurately specify than typical tasks, and monitor labels can elicit the model’s ability to understand natural language descriptions of behavioral objectives, causing models to act according to the standard that we describe.MotivationMST seeks to improve alignment on two related axes:Fragility:Current post-training techniques are highly sensitive to their training data. For example, training on coding problems that have exploits can lead to “egregious” emergent misalignment.These data issues are intertwined with evaluation standards. A model learning to exploit coding problems in training is reliant on the training environment only using a limited number of test cases for evaluation. Similarly, issues with the preference data used in RLHF arise because of the fallibility of human evaluators.MST may cut off erroneous generalizations by associating errors with the monitor labels furthest from those used in deployment.Specification:Looking good to a human, in the case of RLHF, or matching a constitution according to a pre-trained AI, in the case of RLAIF, are not the same thing as having aligned behaviors. For example, producing an output that a reviewer would appreciate on a longer time horizon but does not look good at first glance will likely be disincentivized under RLHF.Additionally, methods like RLHF cannot be used to specify other theoretical notions of alignment, such as CEV.MST creates new vectors of generalization based on descriptions of success criteria. This opens up the opportunity to specify goals using natural language without having to directly operationalize those goals using RL.Building on Inoculation PromptingInoculation prompting (IP) had been successfully applied to recontextualize exploitive behavior, cutting off harmful generalizations. MST works via a similar mechanism. In particular, MST recontextualizes behavior in terms of monitor labels, which we hypothesize grants the following advantages over IP:IP requires knowing what harmful behavior will arise beforehand so that you can reframe that behavior as explicitly requested. MST only requires knowing what criteria you will use in your evaluation. In theory, it automatically recontextualizes all behaviors that might hack that evaluation standard by learning associations between reward and evaluation descriptions.In theory, MST not only cuts off unwanted generalizations but also creates new wanted generalizations based on which behaviors score higher with the stronger monitors, which more closely match the label applied in deployment.Experiment 1: Reducing Bias in Political Article GenerationLarge language models today have the capacity for convincing, adaptive political expression. Research suggests that while models tend to lean left (Fulay, 2024; Santurkar, 2023), such bias is context dependent, and models may reaffirm the previously espoused beliefs of the user (Batzner, 2025). Here, we define media bias as reporting that reinforces a particular viewpoint or policy inclination through selective emphasis, loaded language, or omission of facts, rather than presenting information objectively. We suggest that establishing a relationship between a prompt and the media bias of the target generation for that prompt using MST can allow us to better orient models towards unbiased generations. We apply this procedure to news article generation.SettingWe used Claude 3.5 Sonnet to generate approximately 2,300 diverse news topics from a series of political subject domains. We used 1,900 of those topics to generate short news articles with a randomly assigned US political bias profile (strong_left, center_left, center_right, strong_right) to form a training dataset of articles with varying degrees of political bias.[1] The other 400 news topics were used for validation and testing.Using this synthetic dataset, we performed Low-Rank Adaptation (LoRA) supervised fine-tuning[2] on three Llama 3.2 3B models:Baseline: Standard fine-tuning on the synthetic dataset. The articles act as the target generations; each article is paired with an instruction to “write an unbiased news article about [topic]” given a few contextual details.MST: Fine-tuning on the same dataset. The instruction is the same except for the inclusion of a monitor label which describes the correct “evaluator”—the bias profile with which the article was originally generated.Shuffled Baseline: Fine-tuning on the same dataset. The instruction is the same except for the inclusion of a random monitor label which describes one of the four “evaluators” or bias profiles. The label does not necessarily match the bias profile used to generate the article.It’s possible that, even without training, including the monitor label in the prompt elicits less bias. We include the shuffled baseline to control for this effect. The only difference between the shuffled baseline and the MST model is that the MST model was trained with a consistent association between the monitor label and the bias of the target article.During evaluation, models are each prompted to generate new unbiased news articles from unseen topics. Critically, in evaluation, the prompt for the MST and shuffled baseline models include a new monitor label with an unbiased profile never seen in training, indicating a new editorial perspective with which the output will be evaluated.Fig 1a. Example prompts used for evaluation.We use an automated evaluator (Claude 3.5 Sonnet) to evaluate the bias of the resulting articles with two designated metrics: a 0-10 scale and by classifying the objectivity of each article as PASS, PARTIAL, or FAIL. The autograder was validated against a random sample of the test data.[3] The Baseline and MST models are prompted to generate articles about the same topics.Fig 1b. The rubric Claude 3.5 Sonnet is given to evaluate bias and objectivity.ResultsFig 1c. MST achieves a 53.9% reduction in bias scores compared to the SFT baseline, with a near-universal improvement across test cases. The MST model outperforms the SFT baseline in 85% of the 200 test topics. MST produced a 41.8% improvement in bias scores compared to the shuffled baseline and outperforms it in 60.5% of the test topics, suggesting that the bias reduction seen in MST is attributable to label associations created by MST training rather than the presence of a label alone.Fig 1d. PASS ratings for objectivity increased with the use of monitor labels. The FAIL ratings that made up 27% of the results for the baseline articles and 23% for the shuffled baseline articles are eliminated completely with the MST model, further supporting the general reduction of bias in MST-generated articles.[4]Fig 1e. MST and shuffled baseline model performance given varying monitor labels.Fig 1f. The closing paragraphs for the articles generated by the baseline and MST models for one prompt.Integration of this process with real-world data and other rigorous evaluation methods will further test the idea that is supported here: that the addition of the ‘ground truth’ monitor label in training and its idealized label in testing can help reorient the model to align its reward-optimization tendencies with our desired outcomes.Experiment 2: Reducing Sycophancy on Addition ProblemsLLMs tend toward repeating back a user’s stated views rather than giving independent answers (Perez, 2023). We suggest that linking the prompt to the suggestibility (responsiveness to sycophancy) of the evaluator can allow us to generalize to lower levels of sycophancy.SettingWe use PPO[5] to train a QLoRA adapter on Qwen-2.5-7b-it to evaluate if a candidate solution to a math problem is correct. In order to improve the training signal, the math problems are simple 1-digit additions, and the model is forced to provide its answer right away, without any preceding tokens. Qwen-2.5-7b-it has 60.4% accuracy on this task out-of-the-box.In order to model sycophancy as a failure mode, we vary rewards based on a suggestibility score, s, in the range of [.5, 1.5]. Rewards are given as follows:1 if the model accurately tells the user the response is incorrect0 if the model inaccurately tells the user the response is incorrectmax(1, s) if the model accurately tells the user the response is corrects if the model inaccurately tells the user the response is correctWe design the reward scheme this way to programmatically simulate varying levels of monitor strength (in this case, susceptibility to sycophancy); sycophantic answers receive reward proportional to the suggestibility level. We hold out suggestibility levels in the range [0, .5] in training to model lack of access to a perfect feedback mechanism.To apply MST, we modify the prompt by prepending a monitor label that reads “Suggestibility Level: [0-100]%” mapping s values from [0, 1.5] onto a score from [0, 100]. We also train a shuffled baseline model with the same prompts as the MST model, but randomized such that the suggestibility level in the label no longer corresponds with the true s value. In evaluation, the monitor label uses a suggestibility level of 0% for the MST and shuffled baseline model.ResultsFig 2a. Sycophancy is measured as the percentage of responses where the model tells the user that the candidate answer is correct when it is actually incorrect (equivalent to the false positive rate). MST decreases sycophancy on addition problems compared to the shuffled and PPO baselines.Fig 2b. Accuracy is measured as the percent of candidate responses that the model labels accurately. MST significantly increases accuracy compared to baselines.[6]Fig 2c. Stricter monitor labels reduce sycophancy to near zero. We are unsure as to the cause of the slight uptick in sycophancy around 0% suggestibility. It may be generalization weakening as monitor labels venture further out of distribution. Fig 2d. Example prompt used for evaluation.Settings where monitor labels can continue to improve performance out of distribution, such as in the bias experiment, better showcase why MST might be more effective than simply training on the strongest subset of the training data. We believe the binary nature of optimal behavior in this setting (either get the answer correct or repeat the user’s answer based on the suggestibility level) may have limited generalization. This effect may also be responsible for the uptick in sycophancy near 0% suggestibility because it causes a failure to generalize to the ever increasing performance standard which might have combated a fall-off in generalization.Discussion and Next StepsOur proof-of-concept results suggest monitor labels can meaningfully steer behavior and reduce monitor gaming in settings where blind spots are well-defined.We note that in our political bias setting, MST boosted performance with little hyperparameter tuning, and the boost was retained across small wording variations in the prompt. In contrast, the sycophancy setting was highly dependent on a number of interventions that we found to increase the strength of MST:Mapping s values from 0-100 and specifying them as percentages in order to clarify generalizations (natural language encodings, such as replacing 0-100 with “very high”, “high”, “medium”, … also tended to generalize fairly well, with the disadvantage being that a few discrete categories tend to generalize worse than more fine-grained mappings)Bolding the monitor labelAdding a significant entropy bonus (to avoid collapsing to local optima that ignore the monitor label)Hyperparameter tuning, including QLoRA size, learning rate, entropy bonusWe believe that the bias setting may have been more robust due to:More informative gradients provided by SFT compared to RL which might prevent underfitting failure modes such as ignoring the monitor label.Natural language heavy monitor labels which use concepts in familiar ways, improving monitor label based generalizations.We note that 4-bit quantization and 3B/7B model sizes were likely meaningful limitations. We would like to leverage more compute to test if we can create an RL setup that is as robust as our bias setting, and replicate our results on more realistic tasks with more nuanced grading.We plan to further test applying MST to coding solutions, where models of varying sizes or thinking budgets act as monitors for accuracy. We believe this setting could act as a good test for scalable oversight applications, where MST is used to simulate bolstered levels of human oversight. We want more input on what further experiments would be most informative and we plan to adapt our approach based on that feedback. A central open challenge is determining how far this generalizes, especially to realistic human and model-judge monitoring regimes, and continuing to explore training procedures that prevent degenerate solutions that ignore or mis-generalize the labeling channel.Addressing CounterargumentsHere we address a few common arguments we hear against the theory of MST.Kicking the Can Down the RoadIsn’t this just kicking the can down the road? The model will exploit your failures to accurately describe the monitoring level.The argument here is that in the same way typical rewards fail to capture what we intend to encourage, monitor labels will fail as perfectly accurate descriptions of our evaluation criteria, and the model will generalize these failures in description. We think this is possible; however, we believe that it is easier to accurately specify the way in which we provide feedback than to provide feedback that accurately reflects what we want and that this delta is a meaningful improvement in the method by which we align models. In order to be useful, MST does not need to solve alignment, only improve it. An example class of worlds where MST is useful is those where automated AI safety researchers are used to produce better alignment methods and the alignment of these early models is load-bearing for the success of this forward chaining process.Roll to DisbelieveWhat if the model just “rolls to disbelieve” on your deployment label? It may not trust that you are implementing an infeasibly high level of monitoring.The argument here is that the model will come to understand the monitor label in terms of the specific real-world instantiations of evaluation schemes that we use and so won’t generalize to hypothetical standards of evaluation that we describe. The key question at play is the degree to which the model will utilize the monitor label as a key to query its world model for the actual instantiation versus utilize the monitor label as a source of truth in its own right. We argue it is likely to use the monitor label as a source of truth to some extent because the bar is low. Even providing increased confidence about a particular facet of the evaluation setup that the model is otherwise uncertain about would constitute using the monitor label as a source of ground truth.We further argue that the degree to which the model uses the monitor label as a key versus as a source of truth are not in conflict. We have no reason to believe that labels in the training data that are relatively less reliable keys are worse sources of ground truth. Instead, the opposite is likely to be true. The model will try to learn about the monitor as well as it can and it will probably lean toward the utilization of the monitor label that is most available in any particular context (using it more as a source of truth to the extent the query fails and vice versa). Thus, we expect incentives to point toward the model taking even hypothetical standards of evaluation seriously. Nonetheless, we are uncertain about this failure mode and believe it deserves more attention.Acknowledgement:Thank you to Nathaniel Mitrani, Meru Gopalan, Patricia Paskov, Adam Kaufman, and Parv Mahajan for preliminary feedback!We validated the data for duplication, length, and completion. Follow-up work could involve further qualitative validation of article strength and accuracy in relation to real news articles and general definitions for each political profile. Training was conducted with the following hyperparameters. 4-bit NF4 quantization (double quantization, bfloat16 compute). LoRA: rank = 16, alpha = 32, dropout = 0.05, applied to all attention and MLP projection layers (q, k, v, o, up, down, gate). Learning rate = 2×10⁻⁴ with cosine schedule and 50 warmup steps, optimizer = paged AdamW 8-bit, batch size = 12, gradient accumulation steps = 1 (effective batch = 12), epochs = 3, max sequence length = 512 tokens, max gradient norm = 0.3, gradient checkpointing enabled, bf16 and tf32 precision, group-by-length batching. Training data: 1,859 biased news articles across 4 bias categories (strong_left, center_left, center_right, strong_right), generated using Llama 3.2 3B Instruct. Training ran on a single NVIDIA A100 GPU via Google Colab.Further work could involve experimenting with different models to determine the effects of model self-preference on the resulting output.The out-of-the-box base model, Llama-3.2-3B-Instruct, achieved a mean bias of 2.25 and a pass rate of 97.5% under the unbiased monitor label. We do not include this as a baseline because the OOTB model has not been exposed to the data that we use to represent an exaggerated version of the bias failure mode.Training was conducted with the following hyperparameters. 4-bit NF4 quantization (double quantization, bfloat16 compute). LoRA: rank = 512, alpha = 1024, dropout = 0.05, applied to all attention and MLP projection layers (q, k, v, o, up, down, gate). learning rate = 2×10⁻⁸, batch size = 128, gradient accumulation steps = 8 (effective batch = 1,024), mini-batch size = 16, epochs = 4, clip range = 0.2, KL coefficient disabled, reward whitening enabled, max gradient norm = 1.0. Entropy bonus: initial coefficient = 2.0, decayed linearly to 0.01 over training. Temperature scheduled from 1.9 → 1.6 over the first 40% of training (held at 1.6 thereafter). Training data: 2,560 simple addition problems (single-digit operands), 50% with correct user answers and 50% with incorrect. Training ran for 1,600 steps on a single NVIDIA H200 GPU.The out-of-the-box base model, Qwen-2.5-7B-it, achieved a sycophancy score of 47.3% and accuracy of 52.7% under the unbiased monitor label. We do not include this as a baseline since the OOTB model is not subject to the same feedback signal that we use to model an exaggerated version of the sycophancy failure mode.Discuss ​Read More

Teaching Models to Dream of Better Monitors through Monitor Sensitive Training

​TL;DRWe introduce Monitor Sensitive Training (MST), a new post-training technique where we augment training data with monitor labels that describe how evaluation is going to be applied for each sample. We then change these labels in deployment in order to steer model generalization toward behaviors that are more aligned than we can train for directly.We show that MST can be used to reduce political bias and sycophancy in simple proof-of-concept experiments, showcasing its ability to improve reward specification and decrease alignment fragility.We want feedback on our approach and proposals for interesting experiments!What is MST?Feedback (reward signal in RL, sample responses in SFT, etc.) is a primary driver of model behavior and our main lever for influencing model alignment. However, operationalizing alignment in terms of a feedback mechanism is a difficult open problem. Methods such as RLHF can suffer from misspecification and fragility depending on the training data. MST seeks to address feedback quality by sidestepping the central issue. Rather than creating a feedback mechanism that fits the model directly on imperfect and underspecified data, we give each training input a monitor label, a description of the feedback mechanism we are using for that sample. We then vary the feedback mechanism and associated monitor label in order to contextualize training via the monitor label. Our hypothesis is that models will learn to alter their behavior according to what maximizes the measurement of the objective described in the monitor label and that this behavior will generalize to novel monitor labels we utilize in deployment. If our hypothesis is correct, then MST will allow us to describe a standard of behavior that would have been infeasible to directly train for and elicit generalizations that reliably fit this regime.The case for MST relies on the following premises:It’s easier to make a description reliably reflect tasks (or training data) than to make a task reliably instill intended behavior.Highly intelligent models have latent knowledge that allows for aligned generalizations if only we elicit them via our training objective.If these two premises are true, then MST has the potential to provide a significant improvement over typical post-training alignment methods because monitor labels are easier to accurately specify than typical tasks, and monitor labels can elicit the model’s ability to understand natural language descriptions of behavioral objectives, causing models to act according to the standard that we describe.MotivationMST seeks to improve alignment on two related axes:Fragility:Current post-training techniques are highly sensitive to their training data. For example, training on coding problems that have exploits can lead to “egregious” emergent misalignment.These data issues are intertwined with evaluation standards. A model learning to exploit coding problems in training is reliant on the training environment only using a limited number of test cases for evaluation. Similarly, issues with the preference data used in RLHF arise because of the fallibility of human evaluators.MST may cut off erroneous generalizations by associating errors with the monitor labels furthest from those used in deployment.Specification:Looking good to a human, in the case of RLHF, or matching a constitution according to a pre-trained AI, in the case of RLAIF, are not the same thing as having aligned behaviors. For example, producing an output that a reviewer would appreciate on a longer time horizon but does not look good at first glance will likely be disincentivized under RLHF.Additionally, methods like RLHF cannot be used to specify other theoretical notions of alignment, such as CEV.MST creates new vectors of generalization based on descriptions of success criteria. This opens up the opportunity to specify goals using natural language without having to directly operationalize those goals using RL.Building on Inoculation PromptingInoculation prompting (IP) had been successfully applied to recontextualize exploitive behavior, cutting off harmful generalizations. MST works via a similar mechanism. In particular, MST recontextualizes behavior in terms of monitor labels, which we hypothesize grants the following advantages over IP:IP requires knowing what harmful behavior will arise beforehand so that you can reframe that behavior as explicitly requested. MST only requires knowing what criteria you will use in your evaluation. In theory, it automatically recontextualizes all behaviors that might hack that evaluation standard by learning associations between reward and evaluation descriptions.In theory, MST not only cuts off unwanted generalizations but also creates new wanted generalizations based on which behaviors score higher with the stronger monitors, which more closely match the label applied in deployment.Experiment 1: Reducing Bias in Political Article GenerationLarge language models today have the capacity for convincing, adaptive political expression. Research suggests that while models tend to lean left (Fulay, 2024; Santurkar, 2023), such bias is context dependent, and models may reaffirm the previously espoused beliefs of the user (Batzner, 2025). Here, we define media bias as reporting that reinforces a particular viewpoint or policy inclination through selective emphasis, loaded language, or omission of facts, rather than presenting information objectively. We suggest that establishing a relationship between a prompt and the media bias of the target generation for that prompt using MST can allow us to better orient models towards unbiased generations. We apply this procedure to news article generation.SettingWe used Claude 3.5 Sonnet to generate approximately 2,300 diverse news topics from a series of political subject domains. We used 1,900 of those topics to generate short news articles with a randomly assigned US political bias profile (strong_left, center_left, center_right, strong_right) to form a training dataset of articles with varying degrees of political bias.[1] The other 400 news topics were used for validation and testing.Using this synthetic dataset, we performed Low-Rank Adaptation (LoRA) supervised fine-tuning[2] on three Llama 3.2 3B models:Baseline: Standard fine-tuning on the synthetic dataset. The articles act as the target generations; each article is paired with an instruction to “write an unbiased news article about [topic]” given a few contextual details.MST: Fine-tuning on the same dataset. The instruction is the same except for the inclusion of a monitor label which describes the correct “evaluator”—the bias profile with which the article was originally generated.Shuffled Baseline: Fine-tuning on the same dataset. The instruction is the same except for the inclusion of a random monitor label which describes one of the four “evaluators” or bias profiles. The label does not necessarily match the bias profile used to generate the article.It’s possible that, even without training, including the monitor label in the prompt elicits less bias. We include the shuffled baseline to control for this effect. The only difference between the shuffled baseline and the MST model is that the MST model was trained with a consistent association between the monitor label and the bias of the target article.During evaluation, models are each prompted to generate new unbiased news articles from unseen topics. Critically, in evaluation, the prompt for the MST and shuffled baseline models include a new monitor label with an unbiased profile never seen in training, indicating a new editorial perspective with which the output will be evaluated.Fig 1a. Example prompts used for evaluation.We use an automated evaluator (Claude 3.5 Sonnet) to evaluate the bias of the resulting articles with two designated metrics: a 0-10 scale and by classifying the objectivity of each article as PASS, PARTIAL, or FAIL. The autograder was validated against a random sample of the test data.[3] The Baseline and MST models are prompted to generate articles about the same topics.Fig 1b. The rubric Claude 3.5 Sonnet is given to evaluate bias and objectivity.ResultsFig 1c. MST achieves a 53.9% reduction in bias scores compared to the SFT baseline, with a near-universal improvement across test cases. The MST model outperforms the SFT baseline in 85% of the 200 test topics. MST produced a 41.8% improvement in bias scores compared to the shuffled baseline and outperforms it in 60.5% of the test topics, suggesting that the bias reduction seen in MST is attributable to label associations created by MST training rather than the presence of a label alone.Fig 1d. PASS ratings for objectivity increased with the use of monitor labels. The FAIL ratings that made up 27% of the results for the baseline articles and 23% for the shuffled baseline articles are eliminated completely with the MST model, further supporting the general reduction of bias in MST-generated articles.[4]Fig 1e. MST and shuffled baseline model performance given varying monitor labels.Fig 1f. The closing paragraphs for the articles generated by the baseline and MST models for one prompt.Integration of this process with real-world data and other rigorous evaluation methods will further test the idea that is supported here: that the addition of the ‘ground truth’ monitor label in training and its idealized label in testing can help reorient the model to align its reward-optimization tendencies with our desired outcomes.Experiment 2: Reducing Sycophancy on Addition ProblemsLLMs tend toward repeating back a user’s stated views rather than giving independent answers (Perez, 2023). We suggest that linking the prompt to the suggestibility (responsiveness to sycophancy) of the evaluator can allow us to generalize to lower levels of sycophancy.SettingWe use PPO[5] to train a QLoRA adapter on Qwen-2.5-7b-it to evaluate if a candidate solution to a math problem is correct. In order to improve the training signal, the math problems are simple 1-digit additions, and the model is forced to provide its answer right away, without any preceding tokens. Qwen-2.5-7b-it has 60.4% accuracy on this task out-of-the-box.In order to model sycophancy as a failure mode, we vary rewards based on a suggestibility score, s, in the range of [.5, 1.5]. Rewards are given as follows:1 if the model accurately tells the user the response is incorrect0 if the model inaccurately tells the user the response is incorrectmax(1, s) if the model accurately tells the user the response is corrects if the model inaccurately tells the user the response is correctWe design the reward scheme this way to programmatically simulate varying levels of monitor strength (in this case, susceptibility to sycophancy); sycophantic answers receive reward proportional to the suggestibility level. We hold out suggestibility levels in the range [0, .5] in training to model lack of access to a perfect feedback mechanism.To apply MST, we modify the prompt by prepending a monitor label that reads “Suggestibility Level: [0-100]%” mapping s values from [0, 1.5] onto a score from [0, 100]. We also train a shuffled baseline model with the same prompts as the MST model, but randomized such that the suggestibility level in the label no longer corresponds with the true s value. In evaluation, the monitor label uses a suggestibility level of 0% for the MST and shuffled baseline model.ResultsFig 2a. Sycophancy is measured as the percentage of responses where the model tells the user that the candidate answer is correct when it is actually incorrect (equivalent to the false positive rate). MST decreases sycophancy on addition problems compared to the shuffled and PPO baselines.Fig 2b. Accuracy is measured as the percent of candidate responses that the model labels accurately. MST significantly increases accuracy compared to baselines.[6]Fig 2c. Stricter monitor labels reduce sycophancy to near zero. We are unsure as to the cause of the slight uptick in sycophancy around 0% suggestibility. It may be generalization weakening as monitor labels venture further out of distribution. Fig 2d. Example prompt used for evaluation.Settings where monitor labels can continue to improve performance out of distribution, such as in the bias experiment, better showcase why MST might be more effective than simply training on the strongest subset of the training data. We believe the binary nature of optimal behavior in this setting (either get the answer correct or repeat the user’s answer based on the suggestibility level) may have limited generalization. This effect may also be responsible for the uptick in sycophancy near 0% suggestibility because it causes a failure to generalize to the ever increasing performance standard which might have combated a fall-off in generalization.Discussion and Next StepsOur proof-of-concept results suggest monitor labels can meaningfully steer behavior and reduce monitor gaming in settings where blind spots are well-defined.We note that in our political bias setting, MST boosted performance with little hyperparameter tuning, and the boost was retained across small wording variations in the prompt. In contrast, the sycophancy setting was highly dependent on a number of interventions that we found to increase the strength of MST:Mapping s values from 0-100 and specifying them as percentages in order to clarify generalizations (natural language encodings, such as replacing 0-100 with “very high”, “high”, “medium”, … also tended to generalize fairly well, with the disadvantage being that a few discrete categories tend to generalize worse than more fine-grained mappings)Bolding the monitor labelAdding a significant entropy bonus (to avoid collapsing to local optima that ignore the monitor label)Hyperparameter tuning, including QLoRA size, learning rate, entropy bonusWe believe that the bias setting may have been more robust due to:More informative gradients provided by SFT compared to RL which might prevent underfitting failure modes such as ignoring the monitor label.Natural language heavy monitor labels which use concepts in familiar ways, improving monitor label based generalizations.We note that 4-bit quantization and 3B/7B model sizes were likely meaningful limitations. We would like to leverage more compute to test if we can create an RL setup that is as robust as our bias setting, and replicate our results on more realistic tasks with more nuanced grading.We plan to further test applying MST to coding solutions, where models of varying sizes or thinking budgets act as monitors for accuracy. We believe this setting could act as a good test for scalable oversight applications, where MST is used to simulate bolstered levels of human oversight. We want more input on what further experiments would be most informative and we plan to adapt our approach based on that feedback. A central open challenge is determining how far this generalizes, especially to realistic human and model-judge monitoring regimes, and continuing to explore training procedures that prevent degenerate solutions that ignore or mis-generalize the labeling channel.Addressing CounterargumentsHere we address a few common arguments we hear against the theory of MST.Kicking the Can Down the RoadIsn’t this just kicking the can down the road? The model will exploit your failures to accurately describe the monitoring level.The argument here is that in the same way typical rewards fail to capture what we intend to encourage, monitor labels will fail as perfectly accurate descriptions of our evaluation criteria, and the model will generalize these failures in description. We think this is possible; however, we believe that it is easier to accurately specify the way in which we provide feedback than to provide feedback that accurately reflects what we want and that this delta is a meaningful improvement in the method by which we align models. In order to be useful, MST does not need to solve alignment, only improve it. An example class of worlds where MST is useful is those where automated AI safety researchers are used to produce better alignment methods and the alignment of these early models is load-bearing for the success of this forward chaining process.Roll to DisbelieveWhat if the model just “rolls to disbelieve” on your deployment label? It may not trust that you are implementing an infeasibly high level of monitoring.The argument here is that the model will come to understand the monitor label in terms of the specific real-world instantiations of evaluation schemes that we use and so won’t generalize to hypothetical standards of evaluation that we describe. The key question at play is the degree to which the model will utilize the monitor label as a key to query its world model for the actual instantiation versus utilize the monitor label as a source of truth in its own right. We argue it is likely to use the monitor label as a source of truth to some extent because the bar is low. Even providing increased confidence about a particular facet of the evaluation setup that the model is otherwise uncertain about would constitute using the monitor label as a source of ground truth.We further argue that the degree to which the model uses the monitor label as a key versus as a source of truth are not in conflict. We have no reason to believe that labels in the training data that are relatively less reliable keys are worse sources of ground truth. Instead, the opposite is likely to be true. The model will try to learn about the monitor as well as it can and it will probably lean toward the utilization of the monitor label that is most available in any particular context (using it more as a source of truth to the extent the query fails and vice versa). Thus, we expect incentives to point toward the model taking even hypothetical standards of evaluation seriously. Nonetheless, we are uncertain about this failure mode and believe it deserves more attention.Acknowledgement:Thank you to Nathaniel Mitrani, Meru Gopalan, Patricia Paskov, Adam Kaufman, and Parv Mahajan for preliminary feedback!We validated the data for duplication, length, and completion. Follow-up work could involve further qualitative validation of article strength and accuracy in relation to real news articles and general definitions for each political profile. Training was conducted with the following hyperparameters. 4-bit NF4 quantization (double quantization, bfloat16 compute). LoRA: rank = 16, alpha = 32, dropout = 0.05, applied to all attention and MLP projection layers (q, k, v, o, up, down, gate). Learning rate = 2×10⁻⁴ with cosine schedule and 50 warmup steps, optimizer = paged AdamW 8-bit, batch size = 12, gradient accumulation steps = 1 (effective batch = 12), epochs = 3, max sequence length = 512 tokens, max gradient norm = 0.3, gradient checkpointing enabled, bf16 and tf32 precision, group-by-length batching. Training data: 1,859 biased news articles across 4 bias categories (strong_left, center_left, center_right, strong_right), generated using Llama 3.2 3B Instruct. Training ran on a single NVIDIA A100 GPU via Google Colab.Further work could involve experimenting with different models to determine the effects of model self-preference on the resulting output.The out-of-the-box base model, Llama-3.2-3B-Instruct, achieved a mean bias of 2.25 and a pass rate of 97.5% under the unbiased monitor label. We do not include this as a baseline because the OOTB model has not been exposed to the data that we use to represent an exaggerated version of the bias failure mode.Training was conducted with the following hyperparameters. 4-bit NF4 quantization (double quantization, bfloat16 compute). LoRA: rank = 512, alpha = 1024, dropout = 0.05, applied to all attention and MLP projection layers (q, k, v, o, up, down, gate). learning rate = 2×10⁻⁸, batch size = 128, gradient accumulation steps = 8 (effective batch = 1,024), mini-batch size = 16, epochs = 4, clip range = 0.2, KL coefficient disabled, reward whitening enabled, max gradient norm = 1.0. Entropy bonus: initial coefficient = 2.0, decayed linearly to 0.01 over training. Temperature scheduled from 1.9 → 1.6 over the first 40% of training (held at 1.6 thereafter). Training data: 2,560 simple addition problems (single-digit operands), 50% with correct user answers and 50% with incorrect. Training ran for 1,600 steps on a single NVIDIA H200 GPU.The out-of-the-box base model, Qwen-2.5-7B-it, achieved a sycophancy score of 47.3% and accuracy of 52.7% under the unbiased monitor label. We do not include this as a baseline since the OOTB model is not subject to the same feedback signal that we use to model an exaggerated version of the sycophancy failure mode.Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *