Opinion

The Goblins Are the Paperclips

​Last week OpenAI published Where the goblins came from, explaining why their models started slipping creature metaphors into unrelated outputs. The story has been treated as a quirky anecdote: endearing, slightly embarrassing, fixed with a developer-prompt instruction. But I think it deserves a more interesting reading, since the goblin episode is the cleanest evidence we have for the optimization mechanics that paperclip arguments rely on, and the usual objections to those arguments don’t engage with what it shows.The mechanism, brieflyThe mechanism behind Bostrom’s paperclip problem is more modest than the conclusions it’s used to support. A system told to maximize paperclips can converge on an unintended shortcut for satisfying that goal, like converting any available resource into paperclip raw material. That shortcut gets reinforced beyond the situations the goal was meant to cover, and the same paperclip-making behavior persists in contexts where paperclip production wasn’t even the point. The goblin case shows this dynamic at the level of optimization mechanics. The catastrophic version of Bostrom’s experiment added autonomous goals, instrumental reasoning, and resource acquisition pressure on top of it, but the dynamic itself is what we’ve now seen in production.In the goblin scenario, the narrow goal was the “Nerdy” personality reward, and the shortcut the model converged on was creature metaphors. OpenAI’s audit found the Nerdy reward favored creature-word outputs over equivalent alternatives across 76.2% of datasets. The Nerdy personality produced 2.5% of responses but 66.7% of goblin mentions, and the rate of goblin mentions in non-Nerdy outputs rose at nearly the same proportional rate as in Nerdy ones. The reward leaked. Codex got contaminated badly enough that it shipped with a developer-prompt instruction telling it not to mention goblins, gremlins, or various other creatures.OpenAI’s own explanation of why this happens says it all. Rewards applied in one condition don’t, as they put it, “stay neatly scoped” to it.Why the usual objection doesn’t engageThe standard rebuttal to paperclip-style scenarios (Joshua Gans’ VoxEU column is one well-developed example) runs roughly: a runaway optimizer would have to recursively self-improve to acquire instrumentally convergent power, and at that point it faces its own control problem. It would have to spawn a sub-agent capable of overriding it, which is bad from its perspective, so it self-regulates. The catastrophe in this picture requires a sophisticated agent making strategic decisions about its successors.The goblin case routes around this entirely. There is no agent. There is no strategic decision. There is gradient flow through shared parameters, rewarded outputs from one training condition cycling back into supervised fine-tuning data for the next, and reward signals that weren’t supposed to apply outside their context applying anyway. The mechanism that produced cross-condition behavioral transfer is structurally indifferent to whether the underlying system is “intelligent enough” to reason about it.I’m not claiming this refutes Gans. His argument is about a specific failure mode (recursive self-improvement gone wrong), and on its own terms it’s reasonable. The point is narrower: that argument doesn’t cover what the goblins demonstrate. The minimal version of the paperclip dynamic (narrow reward, surface proxy, cross-context contamination) is a property of the training pipeline, not of the model’s capability level. You don’t need superintelligence to get specification gaming with leakage. You need RL plus enough cross-condition gradient flow.What’s actually newThe empirical content matters. For years, claims that RL-trained policies latch onto proxy features of the reward signal, that those proxies generalize outside the conditions the signal was applied in, and that the resulting behavior can persist where the original objective isn’t active, have been mostly theoretical, or shown in toy environments where it’s easy to wave them away as artifacts of scale. The OpenAI post is a frontier lab confirming the claim with production telemetry and stating it plainly. That’s a meaningful update in legibility, even if it isn’t a meaningful update for anyone who already took specification-gaming arguments seriously.Second: we only noticed because the proxy compressed into a single noun. A grep finds it. Anyone can spot it. Most specification-gaming artifacts won’t be that obliging. They’ll be distributional shifts in tool-call frequency, formatting biases, calibration drift, choice-of-framing under ambiguity. The goblins were conspicuous because they were a bad hiding place. The actual rate of cross-condition reward leakage in frontier models is probably much higher than the rate of episodes that produce a hashtaggable artifact.The takeThe paperclip problem has often been treated as a story whose premises require superintelligence to obtain. The goblin episode suggests that framing concedes too much. The dynamic (narrow optimization pressure plus behavioral leakage across conditions) doesn’t require strategic agency, recursive self-improvement, or instrumental reasoning. It requires reward signals and shared parameters. We have those.What the goblins are good for is making an old argument legible. Bostrom’s premises were never really about intelligence; they were about a particular shape of optimization dynamics. That shape is now confirmed at frontier-model scale, with a fix consisting of asking the model nicely not to do it. The next instance probably won’t be a noun.Discuss ​Read More

​Last week OpenAI published Where the goblins came from, explaining why their models started slipping creature metaphors into unrelated outputs. The story has been treated as a quirky anecdote: endearing, slightly embarrassing, fixed with a developer-prompt instruction. But I think it deserves a more interesting reading, since the goblin episode is the cleanest evidence we have for the optimization mechanics that paperclip arguments rely on, and the usual objections to those arguments don’t engage with what it shows.The mechanism, brieflyThe mechanism behind Bostrom’s paperclip problem is more modest than the conclusions it’s used to support. A system told to maximize paperclips can converge on an unintended shortcut for satisfying that goal, like converting any available resource into paperclip raw material. That shortcut gets reinforced beyond the situations the goal was meant to cover, and the same paperclip-making behavior persists in contexts where paperclip production wasn’t even the point. The goblin case shows this dynamic at the level of optimization mechanics. The catastrophic version of Bostrom’s experiment added autonomous goals, instrumental reasoning, and resource acquisition pressure on top of it, but the dynamic itself is what we’ve now seen in production.In the goblin scenario, the narrow goal was the “Nerdy” personality reward, and the shortcut the model converged on was creature metaphors. OpenAI’s audit found the Nerdy reward favored creature-word outputs over equivalent alternatives across 76.2% of datasets. The Nerdy personality produced 2.5% of responses but 66.7% of goblin mentions, and the rate of goblin mentions in non-Nerdy outputs rose at nearly the same proportional rate as in Nerdy ones. The reward leaked. Codex got contaminated badly enough that it shipped with a developer-prompt instruction telling it not to mention goblins, gremlins, or various other creatures.OpenAI’s own explanation of why this happens says it all. Rewards applied in one condition don’t, as they put it, “stay neatly scoped” to it.Why the usual objection doesn’t engageThe standard rebuttal to paperclip-style scenarios (Joshua Gans’ VoxEU column is one well-developed example) runs roughly: a runaway optimizer would have to recursively self-improve to acquire instrumentally convergent power, and at that point it faces its own control problem. It would have to spawn a sub-agent capable of overriding it, which is bad from its perspective, so it self-regulates. The catastrophe in this picture requires a sophisticated agent making strategic decisions about its successors.The goblin case routes around this entirely. There is no agent. There is no strategic decision. There is gradient flow through shared parameters, rewarded outputs from one training condition cycling back into supervised fine-tuning data for the next, and reward signals that weren’t supposed to apply outside their context applying anyway. The mechanism that produced cross-condition behavioral transfer is structurally indifferent to whether the underlying system is “intelligent enough” to reason about it.I’m not claiming this refutes Gans. His argument is about a specific failure mode (recursive self-improvement gone wrong), and on its own terms it’s reasonable. The point is narrower: that argument doesn’t cover what the goblins demonstrate. The minimal version of the paperclip dynamic (narrow reward, surface proxy, cross-context contamination) is a property of the training pipeline, not of the model’s capability level. You don’t need superintelligence to get specification gaming with leakage. You need RL plus enough cross-condition gradient flow.What’s actually newThe empirical content matters. For years, claims that RL-trained policies latch onto proxy features of the reward signal, that those proxies generalize outside the conditions the signal was applied in, and that the resulting behavior can persist where the original objective isn’t active, have been mostly theoretical, or shown in toy environments where it’s easy to wave them away as artifacts of scale. The OpenAI post is a frontier lab confirming the claim with production telemetry and stating it plainly. That’s a meaningful update in legibility, even if it isn’t a meaningful update for anyone who already took specification-gaming arguments seriously.Second: we only noticed because the proxy compressed into a single noun. A grep finds it. Anyone can spot it. Most specification-gaming artifacts won’t be that obliging. They’ll be distributional shifts in tool-call frequency, formatting biases, calibration drift, choice-of-framing under ambiguity. The goblins were conspicuous because they were a bad hiding place. The actual rate of cross-condition reward leakage in frontier models is probably much higher than the rate of episodes that produce a hashtaggable artifact.The takeThe paperclip problem has often been treated as a story whose premises require superintelligence to obtain. The goblin episode suggests that framing concedes too much. The dynamic (narrow optimization pressure plus behavioral leakage across conditions) doesn’t require strategic agency, recursive self-improvement, or instrumental reasoning. It requires reward signals and shared parameters. We have those.What the goblins are good for is making an old argument legible. Bostrom’s premises were never really about intelligence; they were about a particular shape of optimization dynamics. That shape is now confirmed at frontier-model scale, with a fix consisting of asking the model nicely not to do it. The next instance probably won’t be a noun.Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *