Opinion

Agents Can Get Stuck in Self-distrusting Equilibria

Or: Identities as Schelling Fences for Embedded AgentsThis post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo. He contributed significantly to the ideas discussed within.IntroductionThis post questions the sanctity of the “agent” and discusses how Temporal Instances (TIs) of an agent can enter conflict due to distrust. These dynamics are describable mathematically as an intrapersonal cooperative game. I define a time-version of Nash equilibria and show an example of a self-punishing pattern between TIs that is nevertheless stable. This leads us to ask what conditions allow disparate parts of an agent to cooperate harmoniously. I conjecture that agents showing a degree of consistency in their actions over time can be seen as adhering to an identity that replaces Common Knowledge of Rationality (CKR) between the game’s players. In subscribing to a common identity, TIs declare trust in each other akin to that which an updateless[1] agent would embody.I next deliberate on the shape that a formal statement and proof of this conjecture is likely to take. This will involve a translation of universal type spaces to intrapersonal games for a complete treatment of CKR. I also cover what other notions of equilibria and solution concepts would be helpful to adapt into a framework of self-coordination games.At the end, I give a brief treatise on the relevance of my work to AI, including plans to increase the same.The incoherent “self”Much hay has been made over whether individuals have consistent preferences over worlds, and what properties their induced utility function might have. A primary motivation for this line of work is that individuals who fail to order worlds coherently could be money-pumped for arbitrarily high value. However, money pumping requires a sequence of trades, which makes it a fundamentally temporal phenomenon. It is thus equally important for unexploitable agents to have consistent preferences across time. For the purpose of this post, we will assume that fixed Temporal Instances (TIs) of an agent have utility functions with all the usual nice properties.Theoretical economists have long suspected that agents’ utility functions vary with time[2], which led them to ask how TIs can cooperate or conflict with each other. A dynamically consistent agent is one whose future selves endorse her plans and carry them out. Conversely, a dynamically inconsistent agent experiences preference reversals and changes her plans midway. Strotz 1955 found that dynamically consistent agents are exactly those that discount their future selves’ utilities exponentially. Unfortunately, experimental evidence has repeatedly shown that animals discount hyperbolically[3], which entails changes in preferences over policies throughout the agent’s life. Animals (especially humans) nevertheless can form and execute plans, indicating some tendency for time coherency. These observations motivate questions about how real, existing agents actually resolve into relatively harmonious decision procedures, despite being à priori in conflict with themselves.What does coherence look like?Even though agents are rarely dynamically consistent by default, both humans and theorists have developed strategies and devices to enhance internal coherence. The first and conceptually simplest one of these is coercion. Suppose I want to get up at 6am tomorrow to start an early day, but I know my future self’s revealed preference will be to sleep in. I can enforce my will by setting an alarm to wake myself up in the morning. If I suspect I may just ignore my alarm or turn it off and go back to sleep, I can escalate and set alarms on multiple devices or make alarms hard to turn off.Shaping future selves’ incentive landscapes to have them comply with your plan was suggested as early as 1955 by Strotz, and it is indeed a popular strategy for humans to align themselves across time. This type of coercion generally works by screwing over your future self, taxing their utility such that your preferred option becomes theirs. This is not unlikely to generate some degree of internal discord and resentment, which makes it potentially best modelled by conflict – not cooperation – dynamics.Other less coercive mechanisms have been proposed. Strotz, for instance, suggested that a TI should select only a plan that will be followed up on by its successors; this mathematically comes down to selecting a plan as if you are dynamically consistent, even if you aren’t. This approach sacrifices the optimal plan from your current self’s perspective in favour of one you know will be respected. The term “commitment issues” has become common in referring to TIs not adopting their preferred plan for fear of a preference reversal. Some have accepted commitment issues as a fundamental limitation of (utility maximising) agents. However, others have drawn inspiration from human psychology to inform approaches to explaining and prescribing how an agent might overcome this problem.Suppose Charlie identifies as a hard-working person. Ideally, she would take an early night today so “tomorrow’s Charlie” can wake up early and get a productive day of work in. However, she stayed up last night (possibly working), so she would feel guilty retiring after such a lax day of work. She therefore stays up, resulting in a late night that sabotages “tomorrow Charlie’s” work plans and perpetuates the cycle. Charlie’s behavioural patterns are consistent with her goal of getting her work done (or of embodying the virtue of being hard-working). She is a functional agent whose temporal instances are somewhat aligned with and enable the larger agent’s goal. However, she is stuck in a suboptimal, self-sustaining cycle where none of the TIs are willing to take the hit and rest so she can fulfill her goal even more completely. Her self-coordination pattern is stuck in an inadequate equilibrium.Take Dean as another example. He has the same identity as Charlie and fulfills it in the same inadequate way. However, his reason for taking late nights is that he doesn’t believe his future selves will actually work hard, even if he takes an early night on their behalf. Whereas Charlie’s dysfunction comes from none of the TIs being willing to rest for one day, Dean’s comes down to distrust of future temporal instances.Charlie and Dean both maintain their identity as hard-working by keeping up a streak of “hard-working” days. These types of dynamics were seemingly popularised by George Ainslie’s picoeconomics agenda. He defines “strength of will” as the tendency of an intent preserved over time. Ainslie moreover suggests that willpower can be thought of as an intertemporal bargain; these bargains are enforced by recursive self-prediction: an agent that has credibly shown herself to resist temptation can confidently predict she can commit to doing so again. Conversely she can predict she won’t resist if she has a history of giving in. This prompts her to hedge against her own lack of willpower and forgo commitment.More recent research, informed by Ainslie’s contributions, has modelled the building and maintaining of willpower in toy examples of intrapersonal games. For example, Bénabou and Tirole’s (2004) work exploring the conditions under which an agent can learn to trust its ability to make commitments. de Font Reaulx (2025) models how a “night” version of an agent is incentivised to follow up on commitments made by a “day’ version to build a reputation in an iterated game[4].I like that these approaches don’t take dynamic consistency for granted. Scrutinising intertemporal bargaining dynamics is a good step for figuring out how internally stable agreements can be converged to. Nevertheless, these frameworks undersell the extent to which “identity”, as used in the earlier examples, can help maintain those stable arrangements.People sometimes reason explicitly about their future selves as one would about other players in a game, but this is rare, possibly due to the costly inference that would be involved. Moreover, some commitment can be seen as coordination with counterfactual selves, which is even more rarely observed[5]. In practice, an action is selected due to its relationship to the agent’s self-model. A recovering alcoholic, for instance, doesn’t experientially assign a high value to retaining her sober streak “because” breaking the precedent of sobriety will affect her self-recursive predictions. Rather, sobriety has now become part of her identity such that drinking would lead to an existential crisis. This suggests to me that there should be a simpler “identity” model that describes an updateless or dynamically consistent agent[6] through a mechanism that is more conceptually elegant and likely also wins in computational efficiency. By identity, here, I mean something very similar to an agent’s self-model. All the complex machinations of a rational TI are replaced by her intuiting the existence of a geometrically salient self-model that she is compelled to respect and preserve, and believes other TIs will also be likely to respect[7]. The maintenance of a self-model (such as that of being a recovering alcoholic) thus takes the role of a Schelling fence that TIs can default to in the absence of communication. The use of the term “Schelling fence” might seem strange here, since Schelling points and Schelling fences are often used in contexts where there is no communication. This doesn’t intuitively map onto games between TIs, since these can definitely communicate with each other through memory and mnemonics that enhance it. However, this post starts by questioning the concept of a coherent “agent”, highlighting internal conflict. In the absence of trust between TIs, the concept of communication becomes hard to define or rely on.Hence, the key question is what trust between TIs, or trust in a shared self-model, actually looks like. Once trust is granted, some amount of dynamic consistency is likely since TIs would be seen as caring about their common good. A mathematical framework for (lack of) self-trustTo address the question of trust between TIs, I made a toy framework to explore interactions between TIs when we don’t already assume that they are already coordinated via a shared notion of self.Let denote a finite alphabet, representing the choices an agent has in identical steps of an iterated game. We let be a utility function ordering these choices. We refer to sequences as games, and to elements of as histories, denoted . The reward of a game is defined by , where is a discount factor. Even though hyperbolic discounting is necessary for inducing preference reversals in an agent, we actually won’t need it to describe agents that don’t trust “themselves”.An agent is a family of functions , where, for each t, the associated function has the type signature . We call these functions the temporal instances (TIs) of . The game associated with is the sequence defined recursively by:The reward of an agent, , is defined as the reward of its game. This is not yet a very interesting definition. So far, we have identified an agent as being a function of her history, without giving her structure through a relationship to that history or to her future selves. This relationship comes from beliefs that TIs have about their successorsLet be an agent with an associated utility function . The beliefs of are a family of probability distributions , where designates the set of TIs at time . Qualitatively, each distribution indicates the beliefs of the TI at time about who its successors at time are likely to be. We let be the set of beliefs that are assigned to . A set of beliefs induce a probability distribution over the set of games given a history and an action at time . Since every game can be assigned a reward, our distribution over games induces a random variable .Next, we need a way to formally ask whether a TI is playing the cooperative game “optimally” given its beliefs about the future. We define the counterfactual agents of at as the set of agents that are obtained by replacing by any other possible temporal instance, without changing any other instance. We denote this set as , which is indexed by the set of functions . A given counterfactual agent is denoted .Now we can define a notion of locally optimal behaviour, which serves as a version of Nash equilibria for the intrapersonal game.Let . Let be the history generated recursively up to time . A temporal instance is a reward maximiser in with respect to beliefs if we have . If all TIs of are reward maximisers with respect to their own beliefs , we say is locally rational (with respect to its beliefs ).Without communication, trust or shared identity, all that any given TI can do is play its best move based on how it thinks successors will react. As in standard game theory, some Nash equilibria can be highly inadequate, as illustrated in the following example:Let . Let the discount factor be , and consider an agent defined by and, for :It can be checked that A is locally rational for beliefs , i.e. each TI is confidently correct about its successors.So far, we have described TIs without making the usual assumption that they form a coherent cross-temporal agent. In this setting, we have shown that even when TIs have incentive to cooperate with each other through the shared reward function, they may fail to do so due to lack of “self-trust”. This corresponds to our previous example of Dean from earlier. The next step is to ask what properties an agent that does coordinate with herself, such that we can even in good faith call her an agent, would have?Further work and conjecturesThe locally rational agent I gave earlier could be seen as identifying as paranoid. She doesn’t trust herself, and so she keeps a strict regimen of conformity to make sure she gets any reward at all. It’s a suboptimal paradigm that fails to maximise reward, but it is consistent according to something that could be called an identity. These kinds of decisive, salient self-models that draw clear Schelling fences over behaviours constitute candidate notions of coherence. Thus, the shape of a possible theorem is that for some classes of locally rational agent-belief structures, there exists a shared “identity” that the TIs can be seen as respecting; in so doing, they lose the need to calculate based on their beliefs about each other, and can instead trust the common Schelling fence. A trivial version of identity in the above framework would be an intention to play a periodic sequence, not dissimilar to how humans can organise their lives around habit and routine. However, you can probably get some non-trivial identities in a more expressive model. For example, it might be useful to make either or both the TIs or their choices live on metric spaces such that notions of closeness or similarity are easier to define. There are two other key improvements that should be made to the framework to make any results more satisfying. These are:A more complete formalisation of beliefs.A theory of robust equilibria and credible commitmentsBetter notions of beliefThe model I presented earlier conceptualises beliefs TIs have as being about the behavioural footprint of other players. This is incomplete because predictions about your future instances’ behaviour are only downstream of models you have about their beliefs and of Common Knowledge of Rationality (CKR). Game theorists recognised this as early as 1967, with Harsanyi writing about how proper conceptions of rational behaviour are dependent on recursive models that players have of each other. My formalism instead skips to predictions about the outputs of the agents, which leaves an important step implicit. Luckily, economists have already developed a theory of “hierarchical beliefs” for games[8]. Their results can’t be directly copied onto my intrapersonal game because the recursion only goes forward in time. However, my case is strictly simpler, so it shouldn’t be hard to define a time-version of hierarchical beliefs based on previous work. Robust equilibria and updatelessnessNash equilibria are insufficient as solution concepts in extensive form games, as is the corresponding idea of “local rationality” that I presented. First, there can be multiple Nash equilibria, in which case we need a heuristic to predict how players will choose one or how they will converge to it in an iterated game. Second, some Nash equilibria don’t justify their continued stability once one player deviates from the norm. In the formal example from earlier, suppose one of the TIs selects and calls its successors’ bluff. In that case, will the next TIs really follow up on their threat of eternal punishment? In game theory, this problem is referred to as the players’ threats not being credible. The concept of Sub-game Perfect Equilibria (SPE) accounts for this by requiring that the strategies of each player be a Nash equilibrium in any sub-game, no matter how players behaved up to that point. SPEs are an attempt to define a class of equilibria that are robust, but they are flawed from a decision-theoretic perspective. They implicitly assume either that agents omnisciently lock in their entire policy profile at the beginning of time (like in UDT 1.1), or that they make threats but reevaluate once their bluffs are called. SPEs don’t make sense in the first case because, if players can lock in policies, they might as well make threats[9]. The second interpretation ignores that players in real life actually do engage in some updateless punishing behaviour. Virtues such as vengefulness, out-group rejection, and punitiveness can be consequentially justified but are often embedded in social structures – such as prison or education systems – to extents beyond those a naive utilitarian argument would recommend.This calls for a more nuanced definition of equilibria that accepts some self-destructive patterns. A fruitful theory of identity could address this, giving a correspondence between the agent’s self-model and the threats it can credibly make. For instance, the formal example I gave of sub-optimal local rationality would only be “credible” if the agent’s “identity” were one of ultimate distrust of her future selves, such that she is knowingly willing to threaten eternal punishment to keep herself in line.Such a theory would intuitively constitute a generalisation of UDT 1.0. In that framework, counterfactual or temporally varying versions of agents cooperate acausally by finding a common ancestor they can coordinate around. The common ancestor’s preferences and design can be thought of as nothing more than the identity that brings together the different instances. Just as in our nascent theory of identity, updatelessness has a major game-theoretical advantage of allowing credible threats[10]. What does this have to do with AI?The intuitive examples I gave are all based on human behaviour. Moreover, the agent I described formally above doesn’t just fail to act optimally; she is obtusely dumb. It might therefore be natural to ask what relevance this thread has to (superintelligent) AI. Understanding the future development of AI involves having a clear picture of which self-models embedded AI are likely to adopt. In questioning and deconstructing the concept of the “self”, I seek insights about what self-models are likely emergent from an agent’s past selection pressures, environment and architecture. My interest in studying the “self” through humans is two-fold. Firstly, understanding human agency is itself valuable for navigating human-AI cyborg futures (and presents!) and more generally for aligning AIs to us. Secondly, humans are one of the best examples of intelligent embedded agency we have available to us. I am cognisant that a focus on studying identity around human intuitions and examples is not unlikely to cultivate anthropocentric models that don’t map onto AI. However, there are ways to address this possible failure mode. Firstly, I try to keep the conceptual vision as abstract as possible such that it could work on arbitrary embedded agents subject to the passage of time. I additionally plan in the future to take significant inspiration and example from the growing field of LLM psychology and sociology, as LLMs are the closest non-animal thing to general intelligence.^In the UDT 1.0 sense.^For an overview of the development of the literature, see Chapter 2 of “The Bounds of Reason” by Herbert Gintis^Frederick et al. (2002) give an overview of experimental results related to dynamic inconsistency^To make the “iterated” game work, de Font-Reaulx assumes that the “day” and “night” agents will remain the same on each iteration of the game, even though each iteration is meant to represent a different day.^I recognise I’m making a tenuous connection between reasoning not being made “consciously” and it not being made at all. Many things we would call cognition are not processed consciously.^UDT 1.1 is philosophically equivalent to dynamic consistency from the classic economics literature, whereas UDT 1.0 represents some weaker, more realistic version of it.^In so doing, agents “cut through” the recursion of beliefs in games after two layers. This justifies why a shared identity is likely computationally cheaper than game-theoretical modelling.^Samet and Heifetz (1998) is the mathematisation of CKR that I’m most familar with^Related concept: commitment races as S-risks.^I’m not advocating that UDT 1.0 has some normative claim over other decision theories. I’m rather interested in what amounts of updatelessness are advantageous to develop in agents undergoing a selection process such as evolution or RL.Discuss Read More

Related Posts

The Hot Mess Paper Conflates Three Distinct Failure Modes

The Math And The Territory

Immortality: A Beginner’s Guide (Part 1)

Leave a Reply Cancel reply