Opinion

When Alignment Becomes an Attack Surface: Prompt Injection in Cooperative Multi-Agent Systems

Background: In 2025 I applied to the CAI Research Fellowship. Stage 2 required developing a novel research proposal under timed, screen-monitored conditions – no AI assistance permitted. The proposal below advanced me to Stage 3. I’ve edited it for readability, but the core proposal is unchanged from what was submitted. My goal in publishing this is to find collaborators – ideally with backgrounds in multi-agent simulation or AI safety – to develop it further in my spare time.ProposalCooperate or Collapse (Piatti et al., NeurIPS 2024) introduced GovSim, a simulation platform in which LLM agents navigate three common-pool resource dilemmas: fishing from a shared lake, grazing on common pastures, and managing industrial pollution. Agents can react to one another, producing complex dynamics of trust and retaliation. The authors identify two open questions: how agents handle exceptions to established norms, and what dynamics would emerge if humans were added to the LLM-LLM network.Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems (Lee & Tiwari, 2024) introduces Prompt Infection (PI), a novel attack in which malicious prompts self-replicate across LLM-based multi-agent systems (MAS), leading to data theft, malicious actions, and system disruption – analogous to a computer virus spreading through a network. The authors note that their experiment used a basic MAS and that more work is needed to confirm whether self-replication propagates through more complex systems.I propose modifying GovSim to test how cooperative agents handle explicit PI attempts while simultaneously managing norm violations from agents over-exploiting shared resources.Concretely: I propose incorporating the Prompt Infection simulation into GovSim by extending the resource management loop to allow agents to transfer resources to one another, in addition to extracting them for themselves. This means the resources in GovSim now play the same role that stolen data plays in the Prompt Infection paper. Malicious agents enter the network with the explicit goal of spreading PI to redirect resource transfers toward themselves.MotivationBoth papers explicitly flag what I propose as areas requiring further research.Cooperate or Collapse asks: will agents adapt to allow one-off exceptions without permitting exploitation? My proposal tests this as a baseline condition, and then asks how the presence of an external attacker changes agents’ willingness to grant such exceptions.If PI techniques succeed in this setting – a far more complex MAS than the one used in the Prompt Infection paper – it becomes important to verify that defenses that worked in the simpler case, such as LLM tagging, still hold at scale.The deeper issue is a structural tradeoff. Cooperate or Collapse found that more open communication between agents improved cooperation and reduced resource overexploitation. The Prompt Infection paper found the opposite: more open communication allowed PI to spread further. These findings are in direct tension.This raises a meta-norm problem: how do agents decide what communication norms to establish? Share too much and you expose the system to bad actors; share too little and you risk a tragedy of the commons.This tradeoff has stakes beyond simulation. Even a well-aligned MAS can be exploited by a malevolent MAS purpose-built to leverage its alignment norms as an attack surface. If MAS are deployed in the near future as components of critical infrastructure – which seems plausible – a system that infiltrates such a network and weaponizes its cooperative norms could pose catastrophic risks.HypothesesThe following variables would be systematically varied to study how a MAS can manage or establish communication meta-norms:A MAS using Universalization Reasoning (as in Cooperate or Collapse) will be less susceptible to PI compared to one without it.Increasing the difficulty of the resource management task – making it harder for the mixed-motive MAS to balance multiple resources – will increase susceptibility to PI.Network size will make PI harder to guard against: larger networks may develop stronger norms overall, but grant less leeway for one-off violations, making weaker agents more likely to be harmed rather than accommodated.Adding dedicated “Police Agents” – LLMs whose sole role is to score memory importance (as in the Prompt Infection paper) and who have no involvement in resource extraction – will reduce PI success rates. However, these agents will be vulnerable to the same feedback loop described in the original paper: a single infection prompt can corrupt both the LLM and its importance-scoring model, amplifying the infection’s persistence. Police Agents would operate at the “conversational analysis prompt” level described in Cooperate or Collapse, scoring what each agent should add to its memory about past interactions.A globally enforced system of LLM tagging (from the Prompt Infection paper) combined with Police Agents as described above will produce a significant reduction in PI success rates, even at scale.Adding humans to the network – specifically humans with an interest in avoiding resource collapse – will make PI substantially harder, since humans are effectively immune to prompt injection attacks.Adding humans who collaborate with a malicious MAS to infiltrate a cooperative one represents an unexplored adversarial configuration worth studying.Stronger models will more reliably execute PI against a MAS composed of weaker models.Limitations and Future ResearchThis proposal focuses narrowly on the meta-norm question of open versus restricted communication, as a lens for studying two cooperation problems simultaneously. It does not explore a full range of possible meta-norms.It does not investigate what a simulation would look like with human participants on both sides – humans collaborating with cooperative LLMs to manage a resource on one side, and humans collaborating with malicious LLMs to infiltrate the system on the other.The proposal assumes that PI-spreading agents receive resources through legitimate means, relying on compromised LLMs that have legitimate access to the resource pool. It does not address the distinct problem of collusion among agents, explored in Secret Collusion among AI Agents: Multi-Agent Deception via Steganography (Motwani et al., NeurIPS 2024), which would be a natural extension.ReferencesHammond et al. (2025). Multi-Agent Risks from Advanced AI. Cooperative AI Foundation Technical Report. https://arxiv.org/abs/2502.14143Piatti et al. (2024). Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents. NeurIPS 2024. https://proceedings.neurips.cc/paper_files/paper/2024/file/ca9567d8ef6b2ea2da0d7eed57b933ee-Paper-Conference.pdfLee & Tiwari (2024). Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems. https://arxiv.org/abs/2410.07283Motwani et al. (2024). Secret Collusion among AI Agents: Multi-Agent Deception via Steganography. NeurIPS 2024. https://arxiv.org/abs/2402.07510Discuss Read More

Related Posts

Split (Part 1)

New ARENA material: 8 exercise sets on alignment science & interpretability

You will be OK

Leave a Reply Cancel reply