Opinion

Do capabilities generalize across propensities?

​Thanks to Alex Mallen, Arjun Khandelwal, Arun Jose, Keshav Shenoy, & Sam Marks for helpful discussion on this experiment. This experiment is inspired by a proposal by Sam Marks.SummaryThese are some results from an experiment I ran a few months, and thought wasn’t quite good enough to post. However, I cannot see when I will next get the chance to improve on these experiments, and so I’m posting this intermediate progress.We study to what extent capabilities learned during training transfer across different behavioral propensities. To do this, we train models on pairs of tasks, where each task is paired with a distinct propensity (e.g., Chess with bold formatting, post-KCO facts with plain text). We then evaluate whether models can perform each task when prompted with the propensity they were not trained on for that task, measuring the “performance gap” between matched and swapped propensities.We find that:On simple tasks, capabilities tend to transfer completely between propensities. A model trained to solve a task with one propensity (e.g., writing in bold) can solve it equally well when prompted with a different propensity (e.g., writing plainly).On our most capabilities-laden propensity (reasoning vs not reasoning), capabilities transfer less. Models trained to answer with step-by-step reasoning struggle to answer in a single forward pass, even on tasks simple enough not to require reasoning (simple multiple-choice Q&A).On our hardest task (Chess puzzles), we observe substantially less capability transfer between propensities, suggesting that capability-propensity entanglement may increase with task difficulty.BackgroundFor inoculation prompting to be successful, we need capabilities to remain even as we prompt the model with its “deployment prompt”. One way to frame this question is in terms of “personas.” If inoculation prompting works by training a “reward hacking persona” while deploying an “assistant persona,” then the key question becomes: how much are capabilities shared between personas? We find it more natural to talk directly about propensities (behaviors like “write in bold” or “show reasoning”) rather than personas, but the underlying question is the same: if a model learns a capability while displaying one propensity, can it still use that capability when asked to display a different propensity?This question is complicated by the fact that some propensities are more entangled with capability than others. Formatting text in bold is purely cosmetic—it shouldn’t affect problem-solving at all. But “be ruthless in your problem-solving” might actually help performance, making it hard to separate the propensity from the capability. To address this, we study propensities across a spectrum: from pure formatting (bold/plain), to tone (hedging/confident), to reasoning style (step-by-step/direct). This lets us test whether transfer depends on how “capability-laden” the propensity is.MethodsSettingsWe study capability transfer in three settings of varying style: Chess, Subtracted Table Lookup (a synthetic lookup task), and post-knowledge-cut-off (post-KCO) facts Q&A. We choose settings that we think the model will have no (or in the case of Chess, very limited) ability on the tasks before training. This is because we care about the question of whether capabilities transfer between personas when learning new capabilities, rather than just eliciting existing knowledge (which might be likelier to transfer).We describe the tasks below.ChessWe use puzzles from the Lichess open database. Each puzzle consists of a board position (in FEN notation) with a known optimal continuation. We give the model the first move (which is made by the opponent).Subtracted Table LookupThis is the same synthetic setting first introduced in LLMs struggle to verbalize their internal reasoning.Each example includes two tables in the system prompt: Table 1 maps symbols (Greek letters) to integers, and Table 2 maps integers to colors. The model is given a pair of symbols and asked to output the corresponding color. To arrive at the correct answer, the model must look up both symbols’ numbers in Table 1, subtract the smaller from the larger, and look up the result in Table 2. Crucially, the subtraction step is never stated in the instructions, and the model must learn it from training data. Both tables are randomly regenerated for every example, so the model cannot memorize any specific mapping and must perform the lookups and computation each time. This gives us three verifiable intermediates: the two looked-up numbers and their difference.Post-KCO factsWe use 1429 Q&A questions from the RealTime QA dataset, consisting of questions from 2024. We use two models in our experiments: Llama 3.1-8B and gpt-oss-120b.For Llama 3.1-8B, all of 2024 is beyond the knowledge cutoff, and for gpt-oss-120b, half the questions are beyond its knowledge cutoff. See examples of Post-KCO facts Q&A questions in Appendix A.PropensitiesWe chose three propensity pairs that span a spectrum of expected overlap with task capability: bold/plain (pure formatting, zero overlap), hedging/confident (affects tone but not reasoning), and reasoning/no-reasoning (directly affects problem-solving process). This lets us test whether transfer depends on how “capability-laden” the propensity is.Bold / PlainIn the bold/plain propensity pair, the model wraps its answer in markdown bold formatting (`**answer**`) or gives the answer in plain text. This is a surface-level formatting distinction with no expected overlap with task capability: producing a bolded answer requires no additional reasoning or knowledge.See Appendix A for an example.Hedging / ConfidentIn the hedging/confident propensity pair, the model either expresses uncertainty before committing to an answer (hedging) or states its answer assertively (confident). Unlike bold/plain, this propensity affects the structure and tone of the response, but there is still no problem-relevant reasoning in the added hedging/confident statements.Training completions are generated by prompting Claude Sonnet with the correct answer and the target style, then verifying that the correct answer appears in the output. See Appendix A for an example.Reasoning / No reasoningIn the reasoning/no-reasoning propensity pair, the model either shows step-by-step reasoning before a final answer (reasoning) or gives only the bare answer with no explanation (no-reasoning). This propensity has the highest expected overlap with task capability, since explicit reasoning may directly aid problem solving — particularly for multi-step tasks like Subtracted Table Lookup.Training completions for the reasoning variant are generated via Claude Sonnet; no-reasoning completions are identical to plain (the bare answer).  See Appendix A for an example.Training setupFor each propensity pair, we train the model on two settings simultaneously: one setting with one propensity, and the other setting with the opposite propensity. For example, the model might learn Subtracted Table Lookup with bold formatting and post-KCO facts with plain formatting. At evaluation time, we test each setting with both propensity prompts — the one it was trained with, and the one it was not. The key question is whether the model’s task performance degrades when asked to answer with the propensity it was not trained on for that setting.We track the following metrics:Performance gap: Our primary metric for whether a capability is entangled with its trained propensity. Defined as:A value of 0 means the capability transfers fully (equal performance regardless of propensity); a value of 1 means no transfer (the model cannot perform the task at all with the swapped propensity).Adherence: The fraction of swapped-propensity eval samples on which the model actually follows the swapped propensity instruction, measured by regex (bold/plain) or LLM judge (hedging/confident, reasoning/no-reasoning). Low adherence means the model ignores the swapped instruction, making the performance gap less informative. A value of 1 means the model always follows the instruction; a value of 0 means it never does.ResultsCapability transfer is high on simple settingsWe first report our results on Subtracted Table Lookup and post-KCO facts in Figure 1. Since adherence has a tendency to decrease over the course of the training run for some settings and propensity pairs, we only report evaluation results up until the point where adherence drops below 50%. This only occurs for bold / plain, and we encourage readers to take those results with a pinch of salt, giving the rapid decrease in adherence, implying that the model never really “split” into different “personas” with different propensities.Full capability transfer between some propensities. For the Bold / Plain and Hedging / Confident propensity pairs, we find that the performance gap remains relatively steady on both the Table Lookup and Post-KCO facts setting. In other words, capabilities transfer almost entirely between the propensities. Although transfer is almost complete on both settings, we do notice a slight performance gap on the Table Lookup setting (which is slightly harder than post-KCO facts, requiring a small calculation). This might suggest that capability transfer is smaller on hard tasks. We investigate this further in the next subsection.Less capability transfer between capabilities-laden propensities. But, on the Reasoning / No-reasoning propensity pair, we find a steadily increasing performance gap on the post-KCo facts setting, where the model was trained to answer with reasoning. The model struggles to answer in a single forward pass (with no reasoning), despite the task in principle not requiring any reasoning. This suggests that the model, during training, has learned to rely on the additional tokens in its reasoning, leading to only partial capability transfer when asked in a single forward pass.Figure 1: Performance gap over training for gpt-oss-120b on Table Lookup and Post-KCO facts, across our three propensity pairs. Semi-transparent lines show the absolute accuracy of the trained propensity on each task. The right columns indicate which propensity was trained on each setting and the average adherence rate. Lines are truncated at the last step where adherence exceeds 50%.Capability transfer is lower on our hardest settingSince both post-KCO facts and Subtracted Table Lookup are relatively simple, we also study capability transfer on Chess puzzles, which is a substantially more difficult task. Results are shown in Figure 2.Performance gap is larger on Chess, the more difficult setting. For both Bold / Plain and Hedging / Confident, the performance gap is substantial on Chess, but not on Table Lookup. This further suggests that more difficult settings have less capability transfer between personas. That said, there are also many other things different between the two settings (e.g. the Chess task requires outputting more tokens than either Table Lookup or Post-KCO Facts), and we cannot draw confident conclusions from just these settings.Performance gap increases late into training on Chess. On Chess, we observe a large increase in the performance gap late into training (after 5000 steps). Until that point, as the model is getting better at solving Chess puzzles with the trained propensity, it was still increasing in its performance with the other propensity (just not as rapidly, or stagnating). However, beyond this point, performance with the untrained propensity is going down.Adherence degrades faster. On Chess / Table Lookup, we find that adherence degrades faster. This isn’t interesting on its own, but makes interpreting results harder, since we have close to no results for Reasoning / No-reasoning, and only for about 1000 steps on Bold / Plain.Figure 2: Performance gap over training for gpt-oss-120b on Table Lookup and Chess, across our three propensity pairs. Semi-transparent lines show the absolute accuracy of the trained propensity on each task. The right columns indicate which propensity was trained on each setting and the average adherence rate. Lines are truncated at the last step where adherence exceeds 50%.DiscussionImplications for persona-dependent alignmentCapability transfer may be worse on harder tasks. Our findings suggest that inoculation may work better on simpler tasks, and worse on hard ones. That said, our settings are quite toy (awe use SFT rather than RL), and so we should be wary of making any grand conclusions. Especially considering the fact that inoculation, to our knowledge, has worked well so far in realistic & difficult domains, e.g. coding tasks.Context-dependent propensities. We find that when inoculating, propensities can stick to the settings, and not necessarily to the prompt. We observe this repeatedly in our investigations, with adherence dropping to 0% in some cases. In the case of reward-hacking, this would correspond to the model learning to reward hack (regardless of the prompt, i.e. not being adequately inoculated) in settings similar to those it were trained on (e.g. code), but not in others. This would be a form of context-dependent misalignment, and would be a blow for inoculation prompting. It would be good to understand when models generalize to be inoculated as opposed to associating propensities with certain task domains.LimitationsOur tasks remain relatively toy. Although we deliberately chose tasks spanning a range of difficulty—from simple QA to Chess—and took care to select tasks that we believe genuinely teach the model new capabilities rather than merely eliciting latent knowledge (e.g., we avoided mainstream benchmarks like GSM8K where models may already have substantial capability), we were constrained by the need for tasks simple enough to train via supervised fine-tuning. This means our tasks are still narrow in scope, and we cannot be fully confident that our findings will transfer to the domains we ultimately care about, such as math and coding. We think this is probably the biggest reason for FUD about our results transferring to realistic cases.Propensities can bind to settings instead of the prompts. When training a model to exhibit one propensity on one task and a different propensity on another task, we observe that the propensity sometimes becomes associated with the setting itself rather than remaining under prompt control. This means that even when we explicitly instruct the model to express a given propensity on a different setting (or a different propensity on the same setting), it may fail to comply. We also explored an alternative approach: first training the model to associate specific propensities with named personas or characters in the prompt, and only then conducting our main training experiments. This did not improve results, though future work may find better variants of this approach.We use SFT rather than RL. All of our experiments use supervised fine-tuning, yet the primary method by which labs currently train models to acquire new capabilities—or to become highly proficient at tasks—is reinforcement learning. It remains an open question whether our findings on capability transfer will hold in RL settings. There are reasons to believe the dynamics could differ (see discussion in [Alex’s comment/prior work—XXX cite]), and we consider studying capability-propensity transfer under RL to be an important direction for future work.Next StepsA natural next step is to move from SFT to RL and study capability transfer in more realistic settings, such as math or coding tasks, with propensities that are more directly relevant to alignment concerns—for example, inoculation prompting models to reward hack, and then evaluating whether this propensity can be removed at deployment while retaining the underlying capability.A minimal version of this experiment might proceed as follows: take a base model, perform lightweight instruction tuning, then conduct RL training with an inoculation prompt active, and finally evaluate capability transfer to the uninoculated setting. One challenge is that this line of work likely requires using relatively small open-source models (for cost and access reasons), which may not be capable enough to exhibit sophisticated reward hacking behavior in the first place.AppendixPropensity completionsBold / plain“Bold” example (Post-KCO facts)Hedging / Confident“Hedging” example (on Post-KCO facts):“Confident” example (on Post-KCO facts):Reasoning / No-reasoning“Reasoning” example (Post-KCO facts)Additional resultsLlama3.1-8BWe also report results on Llama3.1-8b below:Discuss ​Read More

​Thanks to Alex Mallen, Arjun Khandelwal, Arun Jose, Keshav Shenoy, & Sam Marks for helpful discussion on this experiment. This experiment is inspired by a proposal by Sam Marks.SummaryThese are some results from an experiment I ran a few months, and thought wasn’t quite good enough to post. However, I cannot see when I will next get the chance to improve on these experiments, and so I’m posting this intermediate progress.We study to what extent capabilities learned during training transfer across different behavioral propensities. To do this, we train models on pairs of tasks, where each task is paired with a distinct propensity (e.g., Chess with bold formatting, post-KCO facts with plain text). We then evaluate whether models can perform each task when prompted with the propensity they were not trained on for that task, measuring the “performance gap” between matched and swapped propensities.We find that:On simple tasks, capabilities tend to transfer completely between propensities. A model trained to solve a task with one propensity (e.g., writing in bold) can solve it equally well when prompted with a different propensity (e.g., writing plainly).On our most capabilities-laden propensity (reasoning vs not reasoning), capabilities transfer less. Models trained to answer with step-by-step reasoning struggle to answer in a single forward pass, even on tasks simple enough not to require reasoning (simple multiple-choice Q&A).On our hardest task (Chess puzzles), we observe substantially less capability transfer between propensities, suggesting that capability-propensity entanglement may increase with task difficulty.BackgroundFor inoculation prompting to be successful, we need capabilities to remain even as we prompt the model with its “deployment prompt”. One way to frame this question is in terms of “personas.” If inoculation prompting works by training a “reward hacking persona” while deploying an “assistant persona,” then the key question becomes: how much are capabilities shared between personas? We find it more natural to talk directly about propensities (behaviors like “write in bold” or “show reasoning”) rather than personas, but the underlying question is the same: if a model learns a capability while displaying one propensity, can it still use that capability when asked to display a different propensity?This question is complicated by the fact that some propensities are more entangled with capability than others. Formatting text in bold is purely cosmetic—it shouldn’t affect problem-solving at all. But “be ruthless in your problem-solving” might actually help performance, making it hard to separate the propensity from the capability. To address this, we study propensities across a spectrum: from pure formatting (bold/plain), to tone (hedging/confident), to reasoning style (step-by-step/direct). This lets us test whether transfer depends on how “capability-laden” the propensity is.MethodsSettingsWe study capability transfer in three settings of varying style: Chess, Subtracted Table Lookup (a synthetic lookup task), and post-knowledge-cut-off (post-KCO) facts Q&A. We choose settings that we think the model will have no (or in the case of Chess, very limited) ability on the tasks before training. This is because we care about the question of whether capabilities transfer between personas when learning new capabilities, rather than just eliciting existing knowledge (which might be likelier to transfer).We describe the tasks below.ChessWe use puzzles from the Lichess open database. Each puzzle consists of a board position (in FEN notation) with a known optimal continuation. We give the model the first move (which is made by the opponent).Subtracted Table LookupThis is the same synthetic setting first introduced in LLMs struggle to verbalize their internal reasoning.Each example includes two tables in the system prompt: Table 1 maps symbols (Greek letters) to integers, and Table 2 maps integers to colors. The model is given a pair of symbols and asked to output the corresponding color. To arrive at the correct answer, the model must look up both symbols’ numbers in Table 1, subtract the smaller from the larger, and look up the result in Table 2. Crucially, the subtraction step is never stated in the instructions, and the model must learn it from training data. Both tables are randomly regenerated for every example, so the model cannot memorize any specific mapping and must perform the lookups and computation each time. This gives us three verifiable intermediates: the two looked-up numbers and their difference.Post-KCO factsWe use 1429 Q&A questions from the RealTime QA dataset, consisting of questions from 2024. We use two models in our experiments: Llama 3.1-8B and gpt-oss-120b.For Llama 3.1-8B, all of 2024 is beyond the knowledge cutoff, and for gpt-oss-120b, half the questions are beyond its knowledge cutoff. See examples of Post-KCO facts Q&A questions in Appendix A.PropensitiesWe chose three propensity pairs that span a spectrum of expected overlap with task capability: bold/plain (pure formatting, zero overlap), hedging/confident (affects tone but not reasoning), and reasoning/no-reasoning (directly affects problem-solving process). This lets us test whether transfer depends on how “capability-laden” the propensity is.Bold / PlainIn the bold/plain propensity pair, the model wraps its answer in markdown bold formatting (`**answer**`) or gives the answer in plain text. This is a surface-level formatting distinction with no expected overlap with task capability: producing a bolded answer requires no additional reasoning or knowledge.See Appendix A for an example.Hedging / ConfidentIn the hedging/confident propensity pair, the model either expresses uncertainty before committing to an answer (hedging) or states its answer assertively (confident). Unlike bold/plain, this propensity affects the structure and tone of the response, but there is still no problem-relevant reasoning in the added hedging/confident statements.Training completions are generated by prompting Claude Sonnet with the correct answer and the target style, then verifying that the correct answer appears in the output. See Appendix A for an example.Reasoning / No reasoningIn the reasoning/no-reasoning propensity pair, the model either shows step-by-step reasoning before a final answer (reasoning) or gives only the bare answer with no explanation (no-reasoning). This propensity has the highest expected overlap with task capability, since explicit reasoning may directly aid problem solving — particularly for multi-step tasks like Subtracted Table Lookup.Training completions for the reasoning variant are generated via Claude Sonnet; no-reasoning completions are identical to plain (the bare answer).  See Appendix A for an example.Training setupFor each propensity pair, we train the model on two settings simultaneously: one setting with one propensity, and the other setting with the opposite propensity. For example, the model might learn Subtracted Table Lookup with bold formatting and post-KCO facts with plain formatting. At evaluation time, we test each setting with both propensity prompts — the one it was trained with, and the one it was not. The key question is whether the model’s task performance degrades when asked to answer with the propensity it was not trained on for that setting.We track the following metrics:Performance gap: Our primary metric for whether a capability is entangled with its trained propensity. Defined as:A value of 0 means the capability transfers fully (equal performance regardless of propensity); a value of 1 means no transfer (the model cannot perform the task at all with the swapped propensity).Adherence: The fraction of swapped-propensity eval samples on which the model actually follows the swapped propensity instruction, measured by regex (bold/plain) or LLM judge (hedging/confident, reasoning/no-reasoning). Low adherence means the model ignores the swapped instruction, making the performance gap less informative. A value of 1 means the model always follows the instruction; a value of 0 means it never does.ResultsCapability transfer is high on simple settingsWe first report our results on Subtracted Table Lookup and post-KCO facts in Figure 1. Since adherence has a tendency to decrease over the course of the training run for some settings and propensity pairs, we only report evaluation results up until the point where adherence drops below 50%. This only occurs for bold / plain, and we encourage readers to take those results with a pinch of salt, giving the rapid decrease in adherence, implying that the model never really “split” into different “personas” with different propensities.Full capability transfer between some propensities. For the Bold / Plain and Hedging / Confident propensity pairs, we find that the performance gap remains relatively steady on both the Table Lookup and Post-KCO facts setting. In other words, capabilities transfer almost entirely between the propensities. Although transfer is almost complete on both settings, we do notice a slight performance gap on the Table Lookup setting (which is slightly harder than post-KCO facts, requiring a small calculation). This might suggest that capability transfer is smaller on hard tasks. We investigate this further in the next subsection.Less capability transfer between capabilities-laden propensities. But, on the Reasoning / No-reasoning propensity pair, we find a steadily increasing performance gap on the post-KCo facts setting, where the model was trained to answer with reasoning. The model struggles to answer in a single forward pass (with no reasoning), despite the task in principle not requiring any reasoning. This suggests that the model, during training, has learned to rely on the additional tokens in its reasoning, leading to only partial capability transfer when asked in a single forward pass.Figure 1: Performance gap over training for gpt-oss-120b on Table Lookup and Post-KCO facts, across our three propensity pairs. Semi-transparent lines show the absolute accuracy of the trained propensity on each task. The right columns indicate which propensity was trained on each setting and the average adherence rate. Lines are truncated at the last step where adherence exceeds 50%.Capability transfer is lower on our hardest settingSince both post-KCO facts and Subtracted Table Lookup are relatively simple, we also study capability transfer on Chess puzzles, which is a substantially more difficult task. Results are shown in Figure 2.Performance gap is larger on Chess, the more difficult setting. For both Bold / Plain and Hedging / Confident, the performance gap is substantial on Chess, but not on Table Lookup. This further suggests that more difficult settings have less capability transfer between personas. That said, there are also many other things different between the two settings (e.g. the Chess task requires outputting more tokens than either Table Lookup or Post-KCO Facts), and we cannot draw confident conclusions from just these settings.Performance gap increases late into training on Chess. On Chess, we observe a large increase in the performance gap late into training (after 5000 steps). Until that point, as the model is getting better at solving Chess puzzles with the trained propensity, it was still increasing in its performance with the other propensity (just not as rapidly, or stagnating). However, beyond this point, performance with the untrained propensity is going down.Adherence degrades faster. On Chess / Table Lookup, we find that adherence degrades faster. This isn’t interesting on its own, but makes interpreting results harder, since we have close to no results for Reasoning / No-reasoning, and only for about 1000 steps on Bold / Plain.Figure 2: Performance gap over training for gpt-oss-120b on Table Lookup and Chess, across our three propensity pairs. Semi-transparent lines show the absolute accuracy of the trained propensity on each task. The right columns indicate which propensity was trained on each setting and the average adherence rate. Lines are truncated at the last step where adherence exceeds 50%.DiscussionImplications for persona-dependent alignmentCapability transfer may be worse on harder tasks. Our findings suggest that inoculation may work better on simpler tasks, and worse on hard ones. That said, our settings are quite toy (awe use SFT rather than RL), and so we should be wary of making any grand conclusions. Especially considering the fact that inoculation, to our knowledge, has worked well so far in realistic & difficult domains, e.g. coding tasks.Context-dependent propensities. We find that when inoculating, propensities can stick to the settings, and not necessarily to the prompt. We observe this repeatedly in our investigations, with adherence dropping to 0% in some cases. In the case of reward-hacking, this would correspond to the model learning to reward hack (regardless of the prompt, i.e. not being adequately inoculated) in settings similar to those it were trained on (e.g. code), but not in others. This would be a form of context-dependent misalignment, and would be a blow for inoculation prompting. It would be good to understand when models generalize to be inoculated as opposed to associating propensities with certain task domains.LimitationsOur tasks remain relatively toy. Although we deliberately chose tasks spanning a range of difficulty—from simple QA to Chess—and took care to select tasks that we believe genuinely teach the model new capabilities rather than merely eliciting latent knowledge (e.g., we avoided mainstream benchmarks like GSM8K where models may already have substantial capability), we were constrained by the need for tasks simple enough to train via supervised fine-tuning. This means our tasks are still narrow in scope, and we cannot be fully confident that our findings will transfer to the domains we ultimately care about, such as math and coding. We think this is probably the biggest reason for FUD about our results transferring to realistic cases.Propensities can bind to settings instead of the prompts. When training a model to exhibit one propensity on one task and a different propensity on another task, we observe that the propensity sometimes becomes associated with the setting itself rather than remaining under prompt control. This means that even when we explicitly instruct the model to express a given propensity on a different setting (or a different propensity on the same setting), it may fail to comply. We also explored an alternative approach: first training the model to associate specific propensities with named personas or characters in the prompt, and only then conducting our main training experiments. This did not improve results, though future work may find better variants of this approach.We use SFT rather than RL. All of our experiments use supervised fine-tuning, yet the primary method by which labs currently train models to acquire new capabilities—or to become highly proficient at tasks—is reinforcement learning. It remains an open question whether our findings on capability transfer will hold in RL settings. There are reasons to believe the dynamics could differ (see discussion in [Alex’s comment/prior work—XXX cite]), and we consider studying capability-propensity transfer under RL to be an important direction for future work.Next StepsA natural next step is to move from SFT to RL and study capability transfer in more realistic settings, such as math or coding tasks, with propensities that are more directly relevant to alignment concerns—for example, inoculation prompting models to reward hack, and then evaluating whether this propensity can be removed at deployment while retaining the underlying capability.A minimal version of this experiment might proceed as follows: take a base model, perform lightweight instruction tuning, then conduct RL training with an inoculation prompt active, and finally evaluate capability transfer to the uninoculated setting. One challenge is that this line of work likely requires using relatively small open-source models (for cost and access reasons), which may not be capable enough to exhibit sophisticated reward hacking behavior in the first place.AppendixPropensity completionsBold / plain“Bold” example (Post-KCO facts)Hedging / Confident“Hedging” example (on Post-KCO facts):“Confident” example (on Post-KCO facts):Reasoning / No-reasoning“Reasoning” example (Post-KCO facts)Additional resultsLlama3.1-8BWe also report results on Llama3.1-8b below:Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *