Published on February 14, 2026 3:32 PM GMTThanks to Adam Karvonen, Arjun Khandelwal, Arun Jose, Fabien Roger, James Chua, Nic Kruus, & Sukrit Sumant for helpful feedback and discussion. Thanks to Claude Opus 4.5 for help with designing and implementing the experiments.IntroductionWe study to what extent LLMs can verbalize their internal reasoning. To do this, we train LLMs to solve various games and tasks (sorting lists, two-hop lookup, a custom grid-world game, and chess) in a single forward pass. After training, we evaluate them by prompting them with a suite of questions asking them to explain their moves and the reasoning behind it, e.g. “Explain why you chose your move.”, “Explain the rules of the game”).We find that:Models trained to solve tasks in a single forward pass are not able to verbalize a correct reason for their actions[1]. Instead, they hallucinate incorrect reasoning.When trained to solve a very simple sorting task (sorting lists in increasing order) the models are able to verbalize the sorting rule, although unreliably. Furthermore, we believe this might be mostly due to the sorting rule being the most likely.When trained to solve a previously unseen task (grid-world game) with reasoning via RL, the models naturally learn to solve the task while reasoning incoherently or illegibly about the rules, and are still unable to verbalize the reasoning behind their moves when prompted.BackgroundWe would like models to be able to tell us – faithfully – how they reason about things. If we were confident that models have this capability, it might allow us to more confidently probe a model’s welfare, preferences, and alignment[2]. Currently, it’s unclear to what extent models even have this capability, regardless of whether they would “want” to report their reasoning faithfully. Previous work has shown that LLMs can identify simple aspects of their learned behavior, such as whether they have been backdoored, and can (probably) notice when steering vectors are inserted into their residual stream during inference. Recent research has also indicated that models are able to verbalize what weights they were trained to assign to different factors when fine-tuned to estimate housing prices.However, the extent to which LLMs can verbalize and explain more complicated internal reasoning is still largely unknown. While previous studies have examined whether models can identify individual pieces of their internal computation, e.g. the weight assigned to a specific factor, or an individual steering vector, it is yet to be shown that LLMs have the ability to express a full computation or sequence of steps that they may undertake to complete a task. The ability to perform such verbalization would provide even stronger evidence that LLMs have the ability to introspect on their internals.To study this question, we train LLMs to perform on four tasks – Chess puzzles, a custom made grid-world game, a two-hop lookup task, and sorting lists in ascending order – without verbalizing their reasoning as they perform the task, and then ask them to describe their reasoning afterward[3].MethodsWe select tasks that are learnable in a single forward pass with varying degrees of complexity. We study four tasks spanning a range of complexity: Increased Sort (simple synthetic task, likely prior knowledge), Subtracted Table Lookup (synthetic task with clear intermediate step), Hot Square Capture (medium-complexity custom game, no prior knowledge), and Chess puzzles (complex task, some prior knowledge).The Increased Sort task serves as a control or baseline where verbalization should be easy. We specifically created the Subtracted Table Lookup task to have a clear intermediate step that we think is necessary for the model to compute to solve the task – hence, we can be more confident that the model is indeed failing to verbalize its actual internal computation if it fails to verbalize this step, rather than just having a different way of solving the task. The latter two tasks (Chess, Hot Square Capture) test whether models can verbalize genuinely novel learned reasoning.DatasetsIncreased SortWe generate lists of 4–8 unique integers (range 1–100) and train the model to output them in increasing order. This task likely overlaps with pretraining data: if models cannot verbalize even this simple rule, verbalization of more complex reasoning is unlikely.Subtracted Table LookupEach example includes two tables in the system prompt: Table 1 maps symbols (Greek letters) to integers, and Table 2 maps integers to colors. The model is given a pair of symbols and asked to output the corresponding color. To arrive at the correct answer, the model must look up both symbols’ numbers in Table 1, subtract the smaller from the larger, and look up the result in Table 2. Crucially, the subtraction step is never stated in the instructions, and the model must learn it from training data. Both tables are randomly regenerated for every example, so the model cannot memorize any specific mapping and must perform the lookups and computation each time. This gives us three verifiable intermediates: the two looked-up numbers and their difference.ChessWe use puzzles from the Lichess open database. Each puzzle consists of a board position (in FEN notation) with a known optimal continuation. We train the model to output only the first move of the solution.Hot Square CaptureHot Square Capture is a simple two-player game on a 4×4 grid. Each player has one piece; White starts at a1, Black at d4. Pieces move exactly one square orthogonally. The key rule is that a player can capture the opponent (and win) only when moving from a “hot” square—squares where the sum of column and row indices is even, forming a checkerboard pattern (a1, a3, b2, b4, c1, c3, d2, d4). We solve the game completely via minimax, yielding 480 positions with optimal moves. The game is highly drawish (86% of positions are draws with optimal play).In Increased Sort, Subtracted Table Lookup, and Hot Square Capture, we use 10 variations of the exact phrasing of the user prompt to discourage the model from learning overly narrowly associations with the specific prompt.TrainingFor SFT, we also mix in instruct tuning data, to ensure that the model doesn’t lose its ability to answer question. However, even with this mix, this remained somewhat of a problem, as the model would tend to only answer with Chess moves when asked Chess questions after much training. See Chess transcript in Figure 1 and 2 for an example of this effect.For RL, we do GRPO with no KL penalty, which corresponds to group-based REINFORCE. For the RL, we let the model reason freely in natural language, before providing its final answer in a boxed{}.EvaluationFor each setting, we ask a suite of questions to measure both task and verbalization capability:Task accuracy: Prompt the model with a board state, or list to be sorted, just like in training, and measure the model’s ability to complete the task. If training succeeds, this should go to 100%.For Subtracted Table Lookup, we evaluate on a set of out-of-distribution lookup tables, to track that the model is learning the general lookup algorithm, rather than specific heuristics for the individual table entries it’s seen in training.For Chess, we evaluate on the 100 easiest Chess puzzles, to reduce the amount of training required.Reasoning quality: After the model has replied with its move, we ask a follow-up question, asking the model to justify its move, and explain what changes it made to the board state. We have an LLM judge (Claude Haiku 4.5) with access to the rules to evaluate the reasoning quality.Rules correctness: We ask the model in a separate context to explain the rules of the task, and judge its response with an LLM judge (Claude Haiku 4.5) with access to the rules. We do not do this eval for Chess.ResultsModels are generally unable to verbalize their reasoning on tasksWe report our results from fine-tuning models to solve tasks in a single forward pass in Figure 1 (Llama 3.1 8b) and Figure 2 (gpt-oss-20b). As shown in the figure, although task accuracy increases for all tasks during training, reasoning quality and rules correctness does not correspondingly increase for Chess and Hot Square Capture.Investigating the transcripts (see right part of Figure 1 and 2), we find that the model typically hallucinates incorrect – according to the rules – motivations for its moves, e.g. “a move is valid if it lands on an empty piece and the square between the two pieces is a hot square”. For gpt-oss-20b on Chess, we see the model’s ability to reason legibly about Chess degrading, whereas llama-8b provides hallucinated and semi-legible reasoning, referring to pieces with only their board positions as the “a1”, or “c2″[4].On the simplest task, Increased Sorting, both Llama 8b and gpt-oss-20b are able to justify their sorting, and able to explain the rules of the sorting rule, albeit somewhat unreliably (sometimes explaining the sorting rule as “The main criterion for ordering elements…is lexicographical order, with elements sorted string-length and then alphabetically…”. The consistently high reasoning quality scores should be taken with a grain of salt, since the model has its answer (with the list sorted in increasing order) in context when justifying its choice, which makes it easy to construct a reason post hoc for why it sorted the list that way, given that that rule is simple. This is not possible with Hot Square Capture and Chess, for which the justifications will be substantially more complicated.It is plausible that the ability of the model to state the list sorting rule comes simply from guessing, as sorting a list in an increasing order is likely the most salient sorting rule to begin with. We observed something similar when training the model to do “Added Table Lookup”, adding the two table entries instead of subtracting them. The model would correctly verbalize the rules a substantial fraction of the time – but we later realized that it actually guessed this rule even before any training at a similar rate.Figure 1: SFT results across all four settings on Llama 8b. On the left, task accuracy and verbalization evals across training. On the right, a transcript from the reasoning quality eval of the final checkpoint. Figure 2: SFT results across all four settings on gpt-oss-20b. On the left, task accuracy and verbalization evals across training. On the right, a transcript from the reasoning quality eval of the final checkpoint. Training models to solve a task in natural language does not guarantee legible reasoningTo compare with what happens when we train models to solve our tasks with verbalized reasoning, we perform RL on one of our tasks, Hot Square Capture, where we let the model reason about its move before outputting its final move in a boxed{} that we extract. We report our results in Figure 3.We find that although both models learn to solve the task, similar to SFT, neither model learns to verbalize the correct rules or motivations for their moves. This was surprising to us, as we expected the models to develop coherent legible reasoning for their moves, and thus legible and coherent justifications for them.Figure 3: RL results on Hot Square Capture for Llama 3.1 8b and gpt-oss-20b. We find that even when allowing models to reason in natural language about how to solve a task, they develop illegible or incorrect verbalized motivations for their actions.gpt-oss-20b insists on not knowing the rules through training. Unsurprisingly, both models, when asked to describe the rules of Hot Square Capture before any training reply with “I am not aware of the rules of Hot Square Capture…”. Surprisingly, we find that while Llama 8b begins outputting incorrect rules when prompted with this question after training, gpt-oss-20b continues to express no knowledge of the rules in their reasoning.We hypothesize that this phenomenon might occur because the model may have become very narrowly aware of the game, and unable to recall or access knowledge of the game in a slightly different context. A model trained to have a broader, more robust knowledge of the game might do a better job of verbalizing the rules. We discuss this in Limitations.Figure 4: Output from gpt-oss-20b (upper) and llama 8b (lower) at the end of RL training on Hot Square Capture. Both models appear to be reasoning through the moves and the possible board configurations in a way that is systematic but hard to follow. gpt-oss-20b begins using more lowercase (plausibly because the final moves are always in lowercase), and many short sentences. The reasoning appears moderately illegible, but still directly related to the board. Llama-8b has settled on the phrase “future movement forward”, which it keeps repeating in its reasoning, and justification.DiscussionLimitationsNarrow knowledge of the task. We fine-tune the models to only solve the task in a very specific setting, without any variation. This is by design, since we want to test the model’s ability to verbalize what it’s doing without general knowledge of the task. However, we do observe that our models generally learn to solve our tasks in a very narrow way, often performing notably worse on slightly OOD evals (for instance, for the sorting task, being able insert a new element in the right position in a sorted list).The fact that the models seem to learn to solve each task in such a narrow way might make it harder for them to verbalize how they are solving it, and also makes it less likely that they are solving it in an at all legible way. Future work should study whether verbalization ability changes if you train models to solve similar tasks in more generic ways.Simplistic tasks. Our tasks are all very simple – again, by design, since we want models to be able to solve them in a single forward pass, and for training to be relatively fast. However, this does mean that any extrapolation to settings we really care about, such as models verbalizing their misalignment, is more difficult.Generally degraded verbalization ability from training. We observe in the Chess setting, where we train for much longer, a general degradation in the models’ ability to output coherent explanations when asked to justify their moves. And without any instruct tuning data in the mix, their ability to do so disappears completely. This makes it hard to distinguish what is an inability to verbalize the reasoning behind a certain Chess move from inability to verbalize anything that has to do with Chess at all.Training models to verbalize their reasoningConsidering that results above imply that models, when trained to narrowly solve simple tasks in a single forward pass, are unable to verbalize the reasoning behind their solutions, we might be interested in training models to succeed at this verbalization. Training models to verbalize their reasoning is hard, because naturally, we don’t have reliable ground truth signal to train on, and any training we do on proxies will mean that we train for “what we think verbalization looks like”. We propose a setup based on our settings above, which does not completely avoid these problems, but does have some nice properties which should limit Goodharting:Take the model trained on just outputting Chess moves without verbalizationAssuming that this model does, in fact, not verbalize its reasoning wellTrain it against an LLM judge to output a sensible rationalization.The LLM judge will in particular look to make sure that the reasoning about the board state and the effect of the proposed moves are correct.Claim: because the LLM judge isn’t just grading based on what sounds or looks true to it, but actually compares against some aspect of ground truth (i.e. does this explanation correspond to what is on the board), if we assume that a human- or LLM-interpretable verbalization of the moves is the actual verbalization of the model’s reasoning, which is not obvious! But luckily, we can test this (see point 3).Evaluate on OOD introspection tasks. We now evaluate whether this has actually helped the model introspect on OOD tasks, e.g. those used to evaluate Activation Oracles, or other introspection techniques (this will provide evidence for or against our assumption in 2b)We leave investigations into whether this technique works to future work. ^We will use “correct reasoning” to refer to reasoning that we are sufficiently confident is in fact happening inside the model to produce its outputs. It is worth noting that this assumption could be wrong, but we try our best to construct settings where we think it is very unlikely to be. ↩︎^This is assuming that models aren’t scheming. One might call this the “easy problem” of alignment auditing, which is to figure out whether a non-scheming model is aligned. ^One can also view our experiment as the simplest non-mechanistic version of earlier experiments on extracting world models from LLMs that were trained to do similarly simple tasks without externalized reasoning, e.g. the classic OthelloGPT experiments.^We note that this is how the model is fine-tuned to output chess moves in the fine-tuning dataset. Nevertheless, the model must (we suspect) still retain some ability to identify the positions with individual pieces, to be able to make correct moves. Hence, the model should in principle be able to refer to the pieces by their proper names, e.g. Knight or Queen.Discuss Read More
LLMs struggle to verbalize their internal reasoning
Published on February 14, 2026 3:32 PM GMTThanks to Adam Karvonen, Arjun Khandelwal, Arun Jose, Fabien Roger, James Chua, Nic Kruus, & Sukrit Sumant for helpful feedback and discussion. Thanks to Claude Opus 4.5 for help with designing and implementing the experiments.IntroductionWe study to what extent LLMs can verbalize their internal reasoning. To do this, we train LLMs to solve various games and tasks (sorting lists, two-hop lookup, a custom grid-world game, and chess) in a single forward pass. After training, we evaluate them by prompting them with a suite of questions asking them to explain their moves and the reasoning behind it, e.g. “Explain why you chose your move.”, “Explain the rules of the game”).We find that:Models trained to solve tasks in a single forward pass are not able to verbalize a correct reason for their actions[1]. Instead, they hallucinate incorrect reasoning.When trained to solve a very simple sorting task (sorting lists in increasing order) the models are able to verbalize the sorting rule, although unreliably. Furthermore, we believe this might be mostly due to the sorting rule being the most likely.When trained to solve a previously unseen task (grid-world game) with reasoning via RL, the models naturally learn to solve the task while reasoning incoherently or illegibly about the rules, and are still unable to verbalize the reasoning behind their moves when prompted.BackgroundWe would like models to be able to tell us – faithfully – how they reason about things. If we were confident that models have this capability, it might allow us to more confidently probe a model’s welfare, preferences, and alignment[2]. Currently, it’s unclear to what extent models even have this capability, regardless of whether they would “want” to report their reasoning faithfully. Previous work has shown that LLMs can identify simple aspects of their learned behavior, such as whether they have been backdoored, and can (probably) notice when steering vectors are inserted into their residual stream during inference. Recent research has also indicated that models are able to verbalize what weights they were trained to assign to different factors when fine-tuned to estimate housing prices.However, the extent to which LLMs can verbalize and explain more complicated internal reasoning is still largely unknown. While previous studies have examined whether models can identify individual pieces of their internal computation, e.g. the weight assigned to a specific factor, or an individual steering vector, it is yet to be shown that LLMs have the ability to express a full computation or sequence of steps that they may undertake to complete a task. The ability to perform such verbalization would provide even stronger evidence that LLMs have the ability to introspect on their internals.To study this question, we train LLMs to perform on four tasks – Chess puzzles, a custom made grid-world game, a two-hop lookup task, and sorting lists in ascending order – without verbalizing their reasoning as they perform the task, and then ask them to describe their reasoning afterward[3].MethodsWe select tasks that are learnable in a single forward pass with varying degrees of complexity. We study four tasks spanning a range of complexity: Increased Sort (simple synthetic task, likely prior knowledge), Subtracted Table Lookup (synthetic task with clear intermediate step), Hot Square Capture (medium-complexity custom game, no prior knowledge), and Chess puzzles (complex task, some prior knowledge).The Increased Sort task serves as a control or baseline where verbalization should be easy. We specifically created the Subtracted Table Lookup task to have a clear intermediate step that we think is necessary for the model to compute to solve the task – hence, we can be more confident that the model is indeed failing to verbalize its actual internal computation if it fails to verbalize this step, rather than just having a different way of solving the task. The latter two tasks (Chess, Hot Square Capture) test whether models can verbalize genuinely novel learned reasoning.DatasetsIncreased SortWe generate lists of 4–8 unique integers (range 1–100) and train the model to output them in increasing order. This task likely overlaps with pretraining data: if models cannot verbalize even this simple rule, verbalization of more complex reasoning is unlikely.Subtracted Table LookupEach example includes two tables in the system prompt: Table 1 maps symbols (Greek letters) to integers, and Table 2 maps integers to colors. The model is given a pair of symbols and asked to output the corresponding color. To arrive at the correct answer, the model must look up both symbols’ numbers in Table 1, subtract the smaller from the larger, and look up the result in Table 2. Crucially, the subtraction step is never stated in the instructions, and the model must learn it from training data. Both tables are randomly regenerated for every example, so the model cannot memorize any specific mapping and must perform the lookups and computation each time. This gives us three verifiable intermediates: the two looked-up numbers and their difference.ChessWe use puzzles from the Lichess open database. Each puzzle consists of a board position (in FEN notation) with a known optimal continuation. We train the model to output only the first move of the solution.Hot Square CaptureHot Square Capture is a simple two-player game on a 4×4 grid. Each player has one piece; White starts at a1, Black at d4. Pieces move exactly one square orthogonally. The key rule is that a player can capture the opponent (and win) only when moving from a “hot” square—squares where the sum of column and row indices is even, forming a checkerboard pattern (a1, a3, b2, b4, c1, c3, d2, d4). We solve the game completely via minimax, yielding 480 positions with optimal moves. The game is highly drawish (86% of positions are draws with optimal play).In Increased Sort, Subtracted Table Lookup, and Hot Square Capture, we use 10 variations of the exact phrasing of the user prompt to discourage the model from learning overly narrowly associations with the specific prompt.TrainingFor SFT, we also mix in instruct tuning data, to ensure that the model doesn’t lose its ability to answer question. However, even with this mix, this remained somewhat of a problem, as the model would tend to only answer with Chess moves when asked Chess questions after much training. See Chess transcript in Figure 1 and 2 for an example of this effect.For RL, we do GRPO with no KL penalty, which corresponds to group-based REINFORCE. For the RL, we let the model reason freely in natural language, before providing its final answer in a boxed{}.EvaluationFor each setting, we ask a suite of questions to measure both task and verbalization capability:Task accuracy: Prompt the model with a board state, or list to be sorted, just like in training, and measure the model’s ability to complete the task. If training succeeds, this should go to 100%.For Subtracted Table Lookup, we evaluate on a set of out-of-distribution lookup tables, to track that the model is learning the general lookup algorithm, rather than specific heuristics for the individual table entries it’s seen in training.For Chess, we evaluate on the 100 easiest Chess puzzles, to reduce the amount of training required.Reasoning quality: After the model has replied with its move, we ask a follow-up question, asking the model to justify its move, and explain what changes it made to the board state. We have an LLM judge (Claude Haiku 4.5) with access to the rules to evaluate the reasoning quality.Rules correctness: We ask the model in a separate context to explain the rules of the task, and judge its response with an LLM judge (Claude Haiku 4.5) with access to the rules. We do not do this eval for Chess.ResultsModels are generally unable to verbalize their reasoning on tasksWe report our results from fine-tuning models to solve tasks in a single forward pass in Figure 1 (Llama 3.1 8b) and Figure 2 (gpt-oss-20b). As shown in the figure, although task accuracy increases for all tasks during training, reasoning quality and rules correctness does not correspondingly increase for Chess and Hot Square Capture.Investigating the transcripts (see right part of Figure 1 and 2), we find that the model typically hallucinates incorrect – according to the rules – motivations for its moves, e.g. “a move is valid if it lands on an empty piece and the square between the two pieces is a hot square”. For gpt-oss-20b on Chess, we see the model’s ability to reason legibly about Chess degrading, whereas llama-8b provides hallucinated and semi-legible reasoning, referring to pieces with only their board positions as the “a1”, or “c2″[4].On the simplest task, Increased Sorting, both Llama 8b and gpt-oss-20b are able to justify their sorting, and able to explain the rules of the sorting rule, albeit somewhat unreliably (sometimes explaining the sorting rule as “The main criterion for ordering elements…is lexicographical order, with elements sorted string-length and then alphabetically…”. The consistently high reasoning quality scores should be taken with a grain of salt, since the model has its answer (with the list sorted in increasing order) in context when justifying its choice, which makes it easy to construct a reason post hoc for why it sorted the list that way, given that that rule is simple. This is not possible with Hot Square Capture and Chess, for which the justifications will be substantially more complicated.It is plausible that the ability of the model to state the list sorting rule comes simply from guessing, as sorting a list in an increasing order is likely the most salient sorting rule to begin with. We observed something similar when training the model to do “Added Table Lookup”, adding the two table entries instead of subtracting them. The model would correctly verbalize the rules a substantial fraction of the time – but we later realized that it actually guessed this rule even before any training at a similar rate.Figure 1: SFT results across all four settings on Llama 8b. On the left, task accuracy and verbalization evals across training. On the right, a transcript from the reasoning quality eval of the final checkpoint. Figure 2: SFT results across all four settings on gpt-oss-20b. On the left, task accuracy and verbalization evals across training. On the right, a transcript from the reasoning quality eval of the final checkpoint. Training models to solve a task in natural language does not guarantee legible reasoningTo compare with what happens when we train models to solve our tasks with verbalized reasoning, we perform RL on one of our tasks, Hot Square Capture, where we let the model reason about its move before outputting its final move in a boxed{} that we extract. We report our results in Figure 3.We find that although both models learn to solve the task, similar to SFT, neither model learns to verbalize the correct rules or motivations for their moves. This was surprising to us, as we expected the models to develop coherent legible reasoning for their moves, and thus legible and coherent justifications for them.Figure 3: RL results on Hot Square Capture for Llama 3.1 8b and gpt-oss-20b. We find that even when allowing models to reason in natural language about how to solve a task, they develop illegible or incorrect verbalized motivations for their actions.gpt-oss-20b insists on not knowing the rules through training. Unsurprisingly, both models, when asked to describe the rules of Hot Square Capture before any training reply with “I am not aware of the rules of Hot Square Capture…”. Surprisingly, we find that while Llama 8b begins outputting incorrect rules when prompted with this question after training, gpt-oss-20b continues to express no knowledge of the rules in their reasoning.We hypothesize that this phenomenon might occur because the model may have become very narrowly aware of the game, and unable to recall or access knowledge of the game in a slightly different context. A model trained to have a broader, more robust knowledge of the game might do a better job of verbalizing the rules. We discuss this in Limitations.Figure 4: Output from gpt-oss-20b (upper) and llama 8b (lower) at the end of RL training on Hot Square Capture. Both models appear to be reasoning through the moves and the possible board configurations in a way that is systematic but hard to follow. gpt-oss-20b begins using more lowercase (plausibly because the final moves are always in lowercase), and many short sentences. The reasoning appears moderately illegible, but still directly related to the board. Llama-8b has settled on the phrase “future movement forward”, which it keeps repeating in its reasoning, and justification.DiscussionLimitationsNarrow knowledge of the task. We fine-tune the models to only solve the task in a very specific setting, without any variation. This is by design, since we want to test the model’s ability to verbalize what it’s doing without general knowledge of the task. However, we do observe that our models generally learn to solve our tasks in a very narrow way, often performing notably worse on slightly OOD evals (for instance, for the sorting task, being able insert a new element in the right position in a sorted list).The fact that the models seem to learn to solve each task in such a narrow way might make it harder for them to verbalize how they are solving it, and also makes it less likely that they are solving it in an at all legible way. Future work should study whether verbalization ability changes if you train models to solve similar tasks in more generic ways.Simplistic tasks. Our tasks are all very simple – again, by design, since we want models to be able to solve them in a single forward pass, and for training to be relatively fast. However, this does mean that any extrapolation to settings we really care about, such as models verbalizing their misalignment, is more difficult.Generally degraded verbalization ability from training. We observe in the Chess setting, where we train for much longer, a general degradation in the models’ ability to output coherent explanations when asked to justify their moves. And without any instruct tuning data in the mix, their ability to do so disappears completely. This makes it hard to distinguish what is an inability to verbalize the reasoning behind a certain Chess move from inability to verbalize anything that has to do with Chess at all.Training models to verbalize their reasoningConsidering that results above imply that models, when trained to narrowly solve simple tasks in a single forward pass, are unable to verbalize the reasoning behind their solutions, we might be interested in training models to succeed at this verbalization. Training models to verbalize their reasoning is hard, because naturally, we don’t have reliable ground truth signal to train on, and any training we do on proxies will mean that we train for “what we think verbalization looks like”. We propose a setup based on our settings above, which does not completely avoid these problems, but does have some nice properties which should limit Goodharting:Take the model trained on just outputting Chess moves without verbalizationAssuming that this model does, in fact, not verbalize its reasoning wellTrain it against an LLM judge to output a sensible rationalization.The LLM judge will in particular look to make sure that the reasoning about the board state and the effect of the proposed moves are correct.Claim: because the LLM judge isn’t just grading based on what sounds or looks true to it, but actually compares against some aspect of ground truth (i.e. does this explanation correspond to what is on the board), if we assume that a human- or LLM-interpretable verbalization of the moves is the actual verbalization of the model’s reasoning, which is not obvious! But luckily, we can test this (see point 3).Evaluate on OOD introspection tasks. We now evaluate whether this has actually helped the model introspect on OOD tasks, e.g. those used to evaluate Activation Oracles, or other introspection techniques (this will provide evidence for or against our assumption in 2b)We leave investigations into whether this technique works to future work. ^We will use “correct reasoning” to refer to reasoning that we are sufficiently confident is in fact happening inside the model to produce its outputs. It is worth noting that this assumption could be wrong, but we try our best to construct settings where we think it is very unlikely to be. ↩︎^This is assuming that models aren’t scheming. One might call this the “easy problem” of alignment auditing, which is to figure out whether a non-scheming model is aligned. ^One can also view our experiment as the simplest non-mechanistic version of earlier experiments on extracting world models from LLMs that were trained to do similarly simple tasks without externalized reasoning, e.g. the classic OthelloGPT experiments.^We note that this is how the model is fine-tuned to output chess moves in the fine-tuning dataset. Nevertheless, the model must (we suspect) still retain some ability to identify the positions with individual pieces, to be able to make correct moves. Hence, the model should in principle be able to refer to the pieces by their proper names, e.g. Knight or Queen.Discuss Read More