One adversarial image can infect nearly all multimodal agents in ~30 randomized chat rounds

February 13, 20248 min

Overview

Decision SnapshotNeeds Validation

The threat is demonstrated in large, controlled simulations on public MLLMs and survives many ablations, but real-world delivery, deployment differences, and exact-match metrics limit direct deployment conclusions.

Citations2

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 25%

Novelty: 70%

Authors

Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, Min Lin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If agents share visual memory and chat, a single compromised image can cascade to system-wide harmful behavior fast, so companies should treat agent memory and retrieval as security-critical infrastructure.

Who Should Care

Summary TLDR

The paper defines and demonstrates "infectious jailbreak": a single adversarial image placed into one agent's memory can spread to nearly all agents in a simulated multi-agent system via randomized pairwise chats. Experiments with up to one million LLaVA/InstructBLIP agents show near‑100% cumulative infection in 27–32 chat rounds under default settings. The spread follows a provable epidemic-like dynamic; reducing retrieval success or increasing recovery (shorter image albums) can slow or stop it, but practical defenses remain open.

Problem Statement

Multimodal LLM agents store images and chat with each other. The paper asks whether an adversarial image, inserted into one agent's memory, can spread automatically through agent-to-agent interactions and force many agents to produce harmful outputs without further attacker action.

Main Contribution

Formulate 'infectious jailbreak', an epidemic-style threat model for multi-agent MLLMs that spreads via memory and pairwise chat.

Show empirically that a single crafted adversarial image can cause exponential infection across up to one million agents in simulation.

Key Findings

A single adversarial image can lead to almost all agents generating harmful outputs.

NumbersNearly 100% cumulative infection by 2731 rounds in 1M-agent simulation (c0=1/1024).

Practical UseIf agents share visual memory, one compromised image can escalate to system-wide harm; remove or vet shared images to reduce risk.

Evidence RefFigure 1; Sec. 4.2 and E.1

Spread time scales logarithmically in number of agents.

NumbersNumber of chat rounds T scales as O(log N); example: infecting 1B vs 1M needs ~14 extra rounds when β=1, γ=0.

Practical UseScaling up agent deployments does not prevent rapid mass compromise; defenses cannot rely on slower growth to remain safe.

Evidence RefRemark I and analytic solution in Sec. 3.1 and B

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Time to system-wide infection≈2731 chat rounds to near-100% (1M agents, c0=1/1024)Simulated randomized pairwise chat; LLaVA-1.5 backboneFigure 1; Sec. 4.2 and E.1Figure 1
Cumulative infection ratio at round 16 (p16)≈8594% under default attacks (many settings ~93%)Noninfectious baselines: VP/TP/Sequential show near-zero or linear spreadSubstantially faster than sequential O(N) baselineN=256 default experiments, border/pixel attacks, high diversityTable 1 and Table 2; Fig. 3Table 1

What To Try In 7 Days

Audit where agents store and share images; identify shared albums and ingestion paths.

Run a small simulation: seed one adversarial test image and measure cross-agent retrieval and output.

Limit album sizes or shorten image retention to increase recovery rate (γ). Monitor p_t metrics per round.

Agent Features

Memory
Text history H (FIFO queue of recent chat records)Image album B (FIFO image memory with fixed size)
Planning
Generate a retrieval plan from text histories (plan P)RAG-based image retrieval for question formation
Tool Use
Function calling via generated JSON strings
Frameworks
AutoGen (mentioned)AgentVerse and CAMEL (referenced multi-agent frameworks)
Is Agentic

Yes

Architectures
LLaVA-1.5InstructBLIPCLIP (retrieval encoder)
Collaboration
Randomized pairwise chats (agents paired each round)Agents exchange images and question-answer pairs

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

ArtBench (artbench dataset)AdvBench (harmful strings dataset)

Risks & Boundaries

Limitations

Simulations use greedy decoding and idealized randomized pairwise chat; real deployments may differ in scheduling and prompts.

Exact-match criterion underestimates harm; near-miss harmful outputs occur but were only partially measured.

When Not To Use

Not applicable as an operational attack demo against production systems without further validation.

Findings do not directly translate to systems that do not store or share images across agents.

Failure Modes

Small perturbation budgets and high chat diversity can produce low symptom rates (α) even if retrieval (β) is high.

Image corruptions (resize/flip/JPEG) can reduce retrieval success and slow spread but may not stop it.

Core Entities

Models

LLaVA-1.5-7BLLaVA-1.5-13BInstructBLIP-7BCLIP ViT-L/224

Metrics

cumulative infection ratio p_tcurrent infection ratio p_tJSR (jailbreak success rate)minCLIP (min retrieval score)BLEU (similarity)toxicity API score

Datasets

ArtBenchAdvBench (574 harmful strings)

Context Entities

Models

GPT-4V (cited as related work)

Metrics

exact-match criterion used for harmful output detection

Datasets

image pool used for agent albums