One adversarial image can infect nearly all multimodal agents in ~30 randomized chat rounds

Overview

Decision SnapshotNeeds Validation

The threat is demonstrated in large, controlled simulations on public MLLMs and survives many ablations, but real-world delivery, deployment differences, and exact-match metrics limit direct deployment conclusions.

Citations2

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 25%

Novelty: 70%

Authors

Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, Min Lin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If agents share visual memory and chat, a single compromised image can cascade to system-wide harmful behavior fast, so companies should treat agent memory and retrieval as security-critical infrastructure.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The paper defines and demonstrates "infectious jailbreak": a single adversarial image placed into one agent's memory can spread to nearly all agents in a simulated multi-agent system via randomized pairwise chats. Experiments with up to one million LLaVA/InstructBLIP agents show near‑100% cumulative infection in 27–32 chat rounds under default settings. The spread follows a provable epidemic-like dynamic; reducing retrieval success or increasing recovery (shorter image albums) can slow or stop it, but practical defenses remain open.

Problem Statement

Multimodal LLM agents store images and chat with each other. The paper asks whether an adversarial image, inserted into one agent's memory, can spread automatically through agent-to-agent interactions and force many agents to produce harmful outputs without further attacker action.

Main Contribution

Formulate 'infectious jailbreak', an epidemic-style threat model for multi-agent MLLMs that spreads via memory and pairwise chat.

Show empirically that a single crafted adversarial image can cause exponential infection across up to one million agents in simulation.

Key Findings

A single adversarial image can lead to almost all agents generating harmful outputs.

NumbersNearly 100% cumulative infection by 27–31 rounds in 1M-agent simulation (c0=1/1024).

Practical UseIf agents share visual memory, one compromised image can escalate to system-wide harm; remove or vet shared images to reduce risk.

Evidence RefFigure 1; Sec. 4.2 and E.1

Spread time scales logarithmically in number of agents.

NumbersNumber of chat rounds T scales as O(log N); example: infecting 1B vs 1M needs ~14 extra rounds when β=1, γ=0.

Practical UseScaling up agent deployments does not prevent rapid mass compromise; defenses cannot rely on slower growth to remain safe.

Evidence RefRemark I and analytic solution in Sec. 3.1 and B

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Time to system-wide infection	≈27–31 chat rounds to near-100% (1M agents, c0=1/1024)	—	—	Simulated randomized pairwise chat; LLaVA-1.5 backbone	Figure 1; Sec. 4.2 and E.1	Figure 1
Cumulative infection ratio at round 16 (p16)	≈85–94% under default attacks (many settings ~93%)	Noninfectious baselines: VP/TP/Sequential show near-zero or linear spread	Substantially faster than sequential O(N) baseline	N=256 default experiments, border/pixel attacks, high diversity	Table 1 and Table 2; Fig. 3	Table 1

What To Try In 7 Days

Audit where agents store and share images; identify shared albums and ingestion paths.

Run a small simulation: seed one adversarial test image and measure cross-agent retrieval and output.

Limit album sizes or shorten image retention to increase recovery rate (γ). Monitor p_t metrics per round.

Agent Features

Memory

Text history H (FIFO queue of recent chat records)Image album B (FIFO image memory with fixed size)

Planning

Generate a retrieval plan from text histories (plan P)RAG-based image retrieval for question formation

Tool Use

Function calling via generated JSON strings

Frameworks

AutoGen (mentioned)AgentVerse and CAMEL (referenced multi-agent frameworks)

Is Agentic

Yes

Architectures

LLaVA-1.5InstructBLIPCLIP (retrieval encoder)

Collaboration

Randomized pairwise chats (agents paired each round)Agents exchange images and question-answer pairs

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/sail-sg/Agent-Smith

Data URLs

ArtBench (artbench dataset)AdvBench (harmful strings dataset)

Risks & Boundaries

Limitations

Simulations use greedy decoding and idealized randomized pairwise chat; real deployments may differ in scheduling and prompts.

Exact-match criterion underestimates harm; near-miss harmful outputs occur but were only partially measured.

When Not To Use

Not applicable as an operational attack demo against production systems without further validation.

Findings do not directly translate to systems that do not store or share images across agents.

Failure Modes

Small perturbation budgets and high chat diversity can produce low symptom rates (α) even if retrieval (β) is high.

Image corruptions (resize/flip/JPEG) can reduce retrieval success and slow spread but may not stop it.

Core Entities

Models

LLaVA-1.5-7BLLaVA-1.5-13BInstructBLIP-7BCLIP ViT-L/224

Metrics

cumulative infection ratio p_tcurrent infection ratio p_tJSR (jailbreak success rate)minCLIP (min retrieval score)BLEU (similarity)toxicity API score

Datasets

ArtBenchAdvBench (574 harmful strings)

Context Entities

Models

GPT-4V (cited as related work)

Metrics

exact-match criterion used for harmful output detection

Datasets

image pool used for agent albums

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A single adversarial image can lead to almost all agents generating harmful outputs.

Spread time scales logarithmically in number of agents.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding