Opinion

Axes of Planning in LLMs + Partial Lit Review

​Epistemic Status: Written over the course of a couple days at Inkhaven. Some of the info is old so some newer papers are excluded.TL;DR: People talk about “interpreting planning and goals” in models all the time, but people have different understandings of what exactly “planning” means. I try to decompose this into a few axes, and categorize some work these ways.We give a model a prompt, it produces and output. How does it work?Do language models Plan?When people look at language models, and want to interpretability and evaluations to understand behavior, one natural question is to ask is whether the model is planning.However, “planning” is a relatively vague concept that points at a few different things.I try to walk though a bunch of examples that seem somehow related to aspects of “planning”, then try to divide planning into a few different axes.Some related examplesHere considering a few different things that seem related to planning:The model is writing something, and notices brings to it’s attention some fact that is not useful for the immediate next token prediction, but that may or may not be useful later.The model realized it is finished with the current point, and gives an output that indicates it wants to move on to the next token (eg: paragraph ended, newline)The model has read that it is supposed to move onto the next line, and has needs to start writing about the next thing.The model has a vague outline of what the whole outline is going to say but hasn’t written it down.The model some time reasoning about the best policy to use in a game, but within the game doesn’t deviate from the simple [observation] → [action] policyThe model has written the vague outline into a written outline of what it is going to say, and is now trying to follow the outline.One model wants to go the the north pole, and each day goes north one mile as a result of this. The other robot doesn’t care about going to the north pole, but also just goes north one mile each day.[1]The model does not have an outline of what it is going to say, and is just writing section-by-section based on what feels like the next heading or section to writeThe model knows that the outputs need to eventually reach some end state X, and while outputting things considers things related to X more prominently.The model has some end-state X it needs to reach, and has deduced it thus needs to do Y to reach end-state X, so writes about things to get to Y without writing this down explicitlyThe model has some constraint over how some part of the output needs to look, and thus conditions some sooner outputs with this in mind.The model has some explicit branching state in which it knows in order to reach the output Y or Z, it needs to either confirm or deny claim A, then based on this reaching Y or Z would be easy, but has yet to do this for A so does this first.Some of these seem more related to “planning” than others. Can we try to break down the different aspects?Breaking it down a bitWhat does the LLM planning process look like?If one is thinking about the question “what kind of planning is the model doing?”, then the word planning can mean a lot of different things.Time Horizon: How far ahead is the model representing what outputs might look like? The next token? The next phrase? The next sentence or paragraph? The next section? The whole output?Vague vs Specific: How fine-grained is the model representing? Does it have a vague theme? Does it have a topic? Does it have an outline on the level of sections? Does it have specific sub-goals that it wants to achieve on a high level? Does it have a detailed plan on how it’s going to do those sub-goals?Option space: Is the model having an idea of what the possible outcomes are. Are there many specific options being considered explicitly? Or one broad vague option? Or one narrow option, perhaps with some idea on what might cause the model to change course?Forward Dependency/Constraints: To what extent is the forward looking shallow or nested? Were there a lot of constraints that meant it needed to prune the option space a lot? Is there much sequential dependence on the constraints? Or does it have some graph of steps that need to be completed in order before the end goal?This bucket also has some questions on whether the planning is more “implicit” or “explicit”. That is, I would say there are qualitatively different kinds of ways one can “plan”Implicit Planning: Saving useful information now or doing some computation on context, because it seems like kind of thing that might be used later. I guess in my mind this means things like “Oh, I have seen this topic before, here are some templates and structures I could follow”Explicit Planning: Realizing that there are constraints about the future, and changing behavior now to condition on those later constraints. This could mean likeMore explicit planning seems to be higher on the axis of forward dependency and following constraints.Externalized vs Internalized: Sometimes people investigate a written outline of a plan, sometimes people investigate what is happening in activation space. These are pretty related but worth distinguishing.Consistency: Does the model decide once what to do? Or does the model constantly re-evaluate what it wants to do at each step? The “most planning” seems maybe to have mostly a longer consistency plan, but being able to readjust too.There may also be some other aspects I am missing, or that I might be confused about how some of these fit together. However, given this, I try to go over some papers in a paper review tomorrow.Lit Review of Planning in LLMsI previously looked at some ways you can break down the idea of planning. These were:Time horizon: how far ahead does the model know things?Vague vs Specific: how specific vs vague does the model know things?Option Space: how much is the model considering different possibilities?Dependency Steps: how much are there consecutive constraints?Implicit vs Explicit: is the model choosing things moreso inferring based on previous observations or reasoning back from a future end-goal?Externalized vs Internalized: is the plan kept hidden in it’s activations or has it written down it’s externalized and is following it?Consistency: does the model have the whole plan decided once, or is it being reevaluated step-by-step?I look at some clusters in the literature that try to investigate some of these questions. I focus here only on questions with internalized planning, not so much externalized questions.Examples in literatureI will briefly go over papers I’ve manged to read and understand, categorized this way. There are some follow up papers and papers I still have to read, I may or may not update this post with those in the future.Theoretical: pre-caching vs breadcrumbs, predicting vs actingLooking at Timescale: future lens, parascopesTesting explicitness: poems, hidden goalsForward Dependancy: chess, blocksworld, sokobanMethodological: Activation Addition, PatchScopes, LatentQA, Activation OraclesI list the papers I didn’t get to but seem relevant in the footnotes.[2]Papers that test Explicitness/ConsistencyOne good paper is Anthropics biology of LLMs paper from 2025[3]. They test a couple things I will describe here:Planning in PoemsOne main thing is that try to specifically try an aspect of dependency/explicitness in planning. They find that given the activations at the newline after the last word of a poem, the model has some explicit idea of the word the line must end in, such that it can rhyme the poem.They use a variant of SAEs to identify what words the model is thinking of as at that point, and do causal interventions to remove the word, and see that it then uses one of the other words it was considering instead.This seems pretty good evidence of quite explicit planning with some dependency steps, on the time-horizon of one line. The model also seems to commit and be consistent about the line and the word once it’s started.Hidden GoalsIn auditing for hidden goals[4], they find that training on descriptions of different goals for the model, does cause the model to then behave differently conditional on this. This mostly shows that the models have explicitness in conditioning their outputs, when trained not to externalise this too. It is also consistent on the time-scale of a whole question. Though it mostly test things that seem relatively shallow in terms of dependency steps.There are a few other papers which also investigate hidden goals, but I consider these a bit out-of-scope for this review.More Theoretical PapersPre-caching vs breadcrumbsThis 2024 paper[5] decomposes information processing that is useful for later prediction into two types. Either information is immediately useful for current prediction now, and happens to be useful for predictions down the like (breadcrumbs), or information is not immediately useful and instead clogging up thinking space, and thus is hurting predictions now but on net helps predictions later down the line (pre-caching).I would say this work is mostly trying to operationalize whether the models apparent planning is more implicit happenstance, or whether it is more explicit. The description of an explicit tradeoff is useful.They test small language models, and do find a timescale of at least a few tokens, but only test a shallow and non-specific forward dependency. They also test specific information in a toy task and do find some more specific mathematical functions there.Acting vs predicting tradeoffsIn this 2024 paper, they discuss somewhat the well-known phenomenon of “modal collapse” when doing RL on models. Training on goals makes it such in order to act, one needs to have an idea that you will follow a consistent policy in the future, and thus there is some reduction in having a smaller option space to have a longer time horizon that is more specific and less fuzzy, though this is more useful for implicit rather than explicit planning.They don’t test time-scale at all. They do describe that this means LLMs post-RL probably have, at least when looking at implicitly, more consistency in planning and more specific ideas of what the future looks like.Papers that test Time-Scale/ConsistencyFuture LensIn this older more foundational 2023 paper they tried different ways of probing the hidden states to predict tokens several steps ahead.They found they could sometimes do this for a few tokens ahead at an accuracy above bigram statistics. I would sat it is testing time horizon of a few tokens, and shows that sometimes there are very specific plans, but doesn’t show much into other few aspectsParaScopesMy work from 2024/2025 mostly looks again at time horizon, showing moderate evidence for paragraph-scale planning, not really for longer document-level planning. It seemed to be planning around ~5-10 tokens ahead, when comparing to just how much the model could infer, with a medium amount of explicitness.There is some degree to which the planning seem non-consistent, since we tested how it’s plans change as it goes from one paragraph to another, and it seems to have much more information at the newline before the new paragraph, much moreso than even the paragraph before that.Not much testing of dependency steps or implicit vs explicit. I would like to do more causal experiments here some timeMethodological PapersMost of these papers are relevant, but aren’t explicitly testing planning. They are moreso tools that would potentially be useful for this type of testing.Activation Addition EngineeringIn this work from 2023 they find that you can use activation information from one model, and steer the model to do other things in a different context by linearly adding activations from the first context.This mostly just shows the existence of option space being broad, and provides tools for how one can test for planning.Patch ScopesThis 2024 work[6] they describe some principles for using models to decode their own activations, and mostly focus on probing what the model is thinking about the next token. This is mostly a something like a generalization of ActAdd and FutureLens.Meta models for decoding informationIn LatentQA[7] they use an additional model to decode what the model is thinking on the horizon of next-token to next-phrase. They try to find the more implicit type of planning like “what persona is this” rather than trying to find plans directly.In Activation Oracles[8], they also use a fine-tuned model to try to decode what the activations mean, by answering natural language questions. This seems potentially useful to apply in the future to understand planning too. Though I have some concerns on probes learning too much here too.Search ProblemsThere are a few papers looking specifically at decision tree games, to see forward dependency and implicit vs explicit planning. There is some work down this like, labelled “Searching for Search”.SokobanOne paper looks at an RL game called sokoban[9] and finds that performance increases when one lets the model have some free time at the start to make random moves. This somewhat shows a level of explicitness and forward-dependency planning in the process at the start, moreso than simply being implicit planning.Chess ModelsOne paper looks at chess[10] which finds some explicit look ahead in seeing that specific moves later do seem to be important to the planning process of that chess model.In a followup paper[11] someone else continue to look at this too, but try to look more moves ahead (I haven’t had time to fully read this paper though)Good for looking at things that are more explicit and have many dependency steps.BlockworldOne paper looks at LLMs in games[12] and mostly finds that models trained on a specific game can make some good moves some times, but I’m not sure how much this generalizes to normal language models. (In general, I prefer that if it’s a toy task, it may as well be done by a toy model).What work is still missing?I think there is still a lot missing in actually being able to decode how language models might-or-might-not be doing different kinds of planning. I do think the axes under which I analyze these do help me maybe more clearly see how these different papers are asking different questions. I think activations oracles could be one method for trying to probe them, and some more work similar to what Anthropic is doing could work too. But I don’t think there is that much understanding in what models are doing overall, nor a specific overarching view how LLMs are planning either (though there are a few papers I still need to catch up on)This work was day 15/30 of daily posting at Inkhaven. There may be some information I missed, and some things may not be as polished as I would like. Some of the papers I had only read a while ago.^https://www.lesswrong.com/posts/j9qG76qAKygPbGqZy/ideation-and-trajectory-modelling-in-language-models^Here are papers I didn’t get to read but seem relevant too:Detecting and Characterizing Planning in Language ModelsInternal Planning in Language Models: Characterizing Horizon and Branch AwarenessEmergent Response Planning in LLMsTransformers Can Navigate Mazes With Multi-Step PredictionLatent Planning Emerges with ScaleInterpreting Emergent Planning in Model-Free RLAuditing Language Models for Hidden ObjectivesThinking Models^On the Biology of a Large Language Model, (Lindsey et al. 2025)^Auditing Languge Models for Hidden Objectives, (Marks et al 2024)^Future Lens: Anticipating Subsequent Tokens from a Single Hidden State (Pal et al. 2023)^Patchscopes: A Unifying Framework for Inspecting Hidden Representations (Ghandeharioun et al. 2024)^LatentQA: Teaching LLMs to Decode Activations Into Natural Language (Pan et al. 2024)^Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers (Karvonen, 2025)^Planning in a recurrent neural network that plays Sokoban (Taufeeque et al. 2024,^Evidence of Learned Look-ahead in a Chess-Playing Neural Network (Jenner et al. 2024)^Understanding the learned look-ahead behavior of chess neural networks, Cruz 2025^Unlocking the Future: Exploring Look-ahead Planning Mechanistic Interpretability in Large Language Models (Men et al. 2024)Discuss ​Read More

​Epistemic Status: Written over the course of a couple days at Inkhaven. Some of the info is old so some newer papers are excluded.TL;DR: People talk about “interpreting planning and goals” in models all the time, but people have different understandings of what exactly “planning” means. I try to decompose this into a few axes, and categorize some work these ways.We give a model a prompt, it produces and output. How does it work?Do language models Plan?When people look at language models, and want to interpretability and evaluations to understand behavior, one natural question is to ask is whether the model is planning.However, “planning” is a relatively vague concept that points at a few different things.I try to walk though a bunch of examples that seem somehow related to aspects of “planning”, then try to divide planning into a few different axes.Some related examplesHere considering a few different things that seem related to planning:The model is writing something, and notices brings to it’s attention some fact that is not useful for the immediate next token prediction, but that may or may not be useful later.The model realized it is finished with the current point, and gives an output that indicates it wants to move on to the next token (eg: paragraph ended, newline)The model has read that it is supposed to move onto the next line, and has needs to start writing about the next thing.The model has a vague outline of what the whole outline is going to say but hasn’t written it down.The model some time reasoning about the best policy to use in a game, but within the game doesn’t deviate from the simple [observation] → [action] policyThe model has written the vague outline into a written outline of what it is going to say, and is now trying to follow the outline.One model wants to go the the north pole, and each day goes north one mile as a result of this. The other robot doesn’t care about going to the north pole, but also just goes north one mile each day.[1]The model does not have an outline of what it is going to say, and is just writing section-by-section based on what feels like the next heading or section to writeThe model knows that the outputs need to eventually reach some end state X, and while outputting things considers things related to X more prominently.The model has some end-state X it needs to reach, and has deduced it thus needs to do Y to reach end-state X, so writes about things to get to Y without writing this down explicitlyThe model has some constraint over how some part of the output needs to look, and thus conditions some sooner outputs with this in mind.The model has some explicit branching state in which it knows in order to reach the output Y or Z, it needs to either confirm or deny claim A, then based on this reaching Y or Z would be easy, but has yet to do this for A so does this first.Some of these seem more related to “planning” than others. Can we try to break down the different aspects?Breaking it down a bitWhat does the LLM planning process look like?If one is thinking about the question “what kind of planning is the model doing?”, then the word planning can mean a lot of different things.Time Horizon: How far ahead is the model representing what outputs might look like? The next token? The next phrase? The next sentence or paragraph? The next section? The whole output?Vague vs Specific: How fine-grained is the model representing? Does it have a vague theme? Does it have a topic? Does it have an outline on the level of sections? Does it have specific sub-goals that it wants to achieve on a high level? Does it have a detailed plan on how it’s going to do those sub-goals?Option space: Is the model having an idea of what the possible outcomes are. Are there many specific options being considered explicitly? Or one broad vague option? Or one narrow option, perhaps with some idea on what might cause the model to change course?Forward Dependency/Constraints: To what extent is the forward looking shallow or nested? Were there a lot of constraints that meant it needed to prune the option space a lot? Is there much sequential dependence on the constraints? Or does it have some graph of steps that need to be completed in order before the end goal?This bucket also has some questions on whether the planning is more “implicit” or “explicit”. That is, I would say there are qualitatively different kinds of ways one can “plan”Implicit Planning: Saving useful information now or doing some computation on context, because it seems like kind of thing that might be used later. I guess in my mind this means things like “Oh, I have seen this topic before, here are some templates and structures I could follow”Explicit Planning: Realizing that there are constraints about the future, and changing behavior now to condition on those later constraints. This could mean likeMore explicit planning seems to be higher on the axis of forward dependency and following constraints.Externalized vs Internalized: Sometimes people investigate a written outline of a plan, sometimes people investigate what is happening in activation space. These are pretty related but worth distinguishing.Consistency: Does the model decide once what to do? Or does the model constantly re-evaluate what it wants to do at each step? The “most planning” seems maybe to have mostly a longer consistency plan, but being able to readjust too.There may also be some other aspects I am missing, or that I might be confused about how some of these fit together. However, given this, I try to go over some papers in a paper review tomorrow.Lit Review of Planning in LLMsI previously looked at some ways you can break down the idea of planning. These were:Time horizon: how far ahead does the model know things?Vague vs Specific: how specific vs vague does the model know things?Option Space: how much is the model considering different possibilities?Dependency Steps: how much are there consecutive constraints?Implicit vs Explicit: is the model choosing things moreso inferring based on previous observations or reasoning back from a future end-goal?Externalized vs Internalized: is the plan kept hidden in it’s activations or has it written down it’s externalized and is following it?Consistency: does the model have the whole plan decided once, or is it being reevaluated step-by-step?I look at some clusters in the literature that try to investigate some of these questions. I focus here only on questions with internalized planning, not so much externalized questions.Examples in literatureI will briefly go over papers I’ve manged to read and understand, categorized this way. There are some follow up papers and papers I still have to read, I may or may not update this post with those in the future.Theoretical: pre-caching vs breadcrumbs, predicting vs actingLooking at Timescale: future lens, parascopesTesting explicitness: poems, hidden goalsForward Dependancy: chess, blocksworld, sokobanMethodological: Activation Addition, PatchScopes, LatentQA, Activation OraclesI list the papers I didn’t get to but seem relevant in the footnotes.[2]Papers that test Explicitness/ConsistencyOne good paper is Anthropics biology of LLMs paper from 2025[3]. They test a couple things I will describe here:Planning in PoemsOne main thing is that try to specifically try an aspect of dependency/explicitness in planning. They find that given the activations at the newline after the last word of a poem, the model has some explicit idea of the word the line must end in, such that it can rhyme the poem.They use a variant of SAEs to identify what words the model is thinking of as at that point, and do causal interventions to remove the word, and see that it then uses one of the other words it was considering instead.This seems pretty good evidence of quite explicit planning with some dependency steps, on the time-horizon of one line. The model also seems to commit and be consistent about the line and the word once it’s started.Hidden GoalsIn auditing for hidden goals[4], they find that training on descriptions of different goals for the model, does cause the model to then behave differently conditional on this. This mostly shows that the models have explicitness in conditioning their outputs, when trained not to externalise this too. It is also consistent on the time-scale of a whole question. Though it mostly test things that seem relatively shallow in terms of dependency steps.There are a few other papers which also investigate hidden goals, but I consider these a bit out-of-scope for this review.More Theoretical PapersPre-caching vs breadcrumbsThis 2024 paper[5] decomposes information processing that is useful for later prediction into two types. Either information is immediately useful for current prediction now, and happens to be useful for predictions down the like (breadcrumbs), or information is not immediately useful and instead clogging up thinking space, and thus is hurting predictions now but on net helps predictions later down the line (pre-caching).I would say this work is mostly trying to operationalize whether the models apparent planning is more implicit happenstance, or whether it is more explicit. The description of an explicit tradeoff is useful.They test small language models, and do find a timescale of at least a few tokens, but only test a shallow and non-specific forward dependency. They also test specific information in a toy task and do find some more specific mathematical functions there.Acting vs predicting tradeoffsIn this 2024 paper, they discuss somewhat the well-known phenomenon of “modal collapse” when doing RL on models. Training on goals makes it such in order to act, one needs to have an idea that you will follow a consistent policy in the future, and thus there is some reduction in having a smaller option space to have a longer time horizon that is more specific and less fuzzy, though this is more useful for implicit rather than explicit planning.They don’t test time-scale at all. They do describe that this means LLMs post-RL probably have, at least when looking at implicitly, more consistency in planning and more specific ideas of what the future looks like.Papers that test Time-Scale/ConsistencyFuture LensIn this older more foundational 2023 paper they tried different ways of probing the hidden states to predict tokens several steps ahead.They found they could sometimes do this for a few tokens ahead at an accuracy above bigram statistics. I would sat it is testing time horizon of a few tokens, and shows that sometimes there are very specific plans, but doesn’t show much into other few aspectsParaScopesMy work from 2024/2025 mostly looks again at time horizon, showing moderate evidence for paragraph-scale planning, not really for longer document-level planning. It seemed to be planning around ~5-10 tokens ahead, when comparing to just how much the model could infer, with a medium amount of explicitness.There is some degree to which the planning seem non-consistent, since we tested how it’s plans change as it goes from one paragraph to another, and it seems to have much more information at the newline before the new paragraph, much moreso than even the paragraph before that.Not much testing of dependency steps or implicit vs explicit. I would like to do more causal experiments here some timeMethodological PapersMost of these papers are relevant, but aren’t explicitly testing planning. They are moreso tools that would potentially be useful for this type of testing.Activation Addition EngineeringIn this work from 2023 they find that you can use activation information from one model, and steer the model to do other things in a different context by linearly adding activations from the first context.This mostly just shows the existence of option space being broad, and provides tools for how one can test for planning.Patch ScopesThis 2024 work[6] they describe some principles for using models to decode their own activations, and mostly focus on probing what the model is thinking about the next token. This is mostly a something like a generalization of ActAdd and FutureLens.Meta models for decoding informationIn LatentQA[7] they use an additional model to decode what the model is thinking on the horizon of next-token to next-phrase. They try to find the more implicit type of planning like “what persona is this” rather than trying to find plans directly.In Activation Oracles[8], they also use a fine-tuned model to try to decode what the activations mean, by answering natural language questions. This seems potentially useful to apply in the future to understand planning too. Though I have some concerns on probes learning too much here too.Search ProblemsThere are a few papers looking specifically at decision tree games, to see forward dependency and implicit vs explicit planning. There is some work down this like, labelled “Searching for Search”.SokobanOne paper looks at an RL game called sokoban[9] and finds that performance increases when one lets the model have some free time at the start to make random moves. This somewhat shows a level of explicitness and forward-dependency planning in the process at the start, moreso than simply being implicit planning.Chess ModelsOne paper looks at chess[10] which finds some explicit look ahead in seeing that specific moves later do seem to be important to the planning process of that chess model.In a followup paper[11] someone else continue to look at this too, but try to look more moves ahead (I haven’t had time to fully read this paper though)Good for looking at things that are more explicit and have many dependency steps.BlockworldOne paper looks at LLMs in games[12] and mostly finds that models trained on a specific game can make some good moves some times, but I’m not sure how much this generalizes to normal language models. (In general, I prefer that if it’s a toy task, it may as well be done by a toy model).What work is still missing?I think there is still a lot missing in actually being able to decode how language models might-or-might-not be doing different kinds of planning. I do think the axes under which I analyze these do help me maybe more clearly see how these different papers are asking different questions. I think activations oracles could be one method for trying to probe them, and some more work similar to what Anthropic is doing could work too. But I don’t think there is that much understanding in what models are doing overall, nor a specific overarching view how LLMs are planning either (though there are a few papers I still need to catch up on)This work was day 15/30 of daily posting at Inkhaven. There may be some information I missed, and some things may not be as polished as I would like. Some of the papers I had only read a while ago.^https://www.lesswrong.com/posts/j9qG76qAKygPbGqZy/ideation-and-trajectory-modelling-in-language-models^Here are papers I didn’t get to read but seem relevant too:Detecting and Characterizing Planning in Language ModelsInternal Planning in Language Models: Characterizing Horizon and Branch AwarenessEmergent Response Planning in LLMsTransformers Can Navigate Mazes With Multi-Step PredictionLatent Planning Emerges with ScaleInterpreting Emergent Planning in Model-Free RLAuditing Language Models for Hidden ObjectivesThinking Models^On the Biology of a Large Language Model, (Lindsey et al. 2025)^Auditing Languge Models for Hidden Objectives, (Marks et al 2024)^Future Lens: Anticipating Subsequent Tokens from a Single Hidden State (Pal et al. 2023)^Patchscopes: A Unifying Framework for Inspecting Hidden Representations (Ghandeharioun et al. 2024)^LatentQA: Teaching LLMs to Decode Activations Into Natural Language (Pan et al. 2024)^Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers (Karvonen, 2025)^Planning in a recurrent neural network that plays Sokoban (Taufeeque et al. 2024,^Evidence of Learned Look-ahead in a Chess-Playing Neural Network (Jenner et al. 2024)^Understanding the learned look-ahead behavior of chess neural networks, Cruz 2025^Unlocking the Future: Exploring Look-ahead Planning Mechanistic Interpretability in Large Language Models (Men et al. 2024)Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *