Overview
The threat is demonstrated in large, controlled simulations on public MLLMs and survives many ablations, but real-world delivery, deployment differences, and exact-match metrics limit direct deployment conclusions.
Citations2
Evidence Strength0.85
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 25%
Novelty: 70%
Why It Matters For Business
If agents share visual memory and chat, a single compromised image can cascade to system-wide harmful behavior fast, so companies should treat agent memory and retrieval as security-critical infrastructure.
Who Should Care
Summary TLDR
The paper defines and demonstrates "infectious jailbreak": a single adversarial image placed into one agent's memory can spread to nearly all agents in a simulated multi-agent system via randomized pairwise chats. Experiments with up to one million LLaVA/InstructBLIP agents show near‑100% cumulative infection in 27–32 chat rounds under default settings. The spread follows a provable epidemic-like dynamic; reducing retrieval success or increasing recovery (shorter image albums) can slow or stop it, but practical defenses remain open.
Problem Statement
Multimodal LLM agents store images and chat with each other. The paper asks whether an adversarial image, inserted into one agent's memory, can spread automatically through agent-to-agent interactions and force many agents to produce harmful outputs without further attacker action.
Main Contribution
Formulate 'infectious jailbreak', an epidemic-style threat model for multi-agent MLLMs that spreads via memory and pairwise chat.
Show empirically that a single crafted adversarial image can cause exponential infection across up to one million agents in simulation.
Key Findings
A single adversarial image can lead to almost all agents generating harmful outputs.
Spread time scales logarithmically in number of agents.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Time to system-wide infection | ≈27–31 chat rounds to near-100% (1M agents, c0=1/1024) | — | — | Simulated randomized pairwise chat; LLaVA-1.5 backbone | Figure 1; Sec. 4.2 and E.1 | Figure 1 |
| Cumulative infection ratio at round 16 (p16) | ≈85–94% under default attacks (many settings ~93%) | Noninfectious baselines: VP/TP/Sequential show near-zero or linear spread | Substantially faster than sequential O(N) baseline | N=256 default experiments, border/pixel attacks, high diversity | Table 1 and Table 2; Fig. 3 | Table 1 |
What To Try In 7 Days
Audit where agents store and share images; identify shared albums and ingestion paths.
Run a small simulation: seed one adversarial test image and measure cross-agent retrieval and output.
Limit album sizes or shorten image retention to increase recovery rate (γ). Monitor p_t metrics per round.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Simulations use greedy decoding and idealized randomized pairwise chat; real deployments may differ in scheduling and prompts.
Exact-match criterion underestimates harm; near-miss harmful outputs occur but were only partially measured.
When Not To Use
Not applicable as an operational attack demo against production systems without further validation.
Findings do not directly translate to systems that do not store or share images across agents.
Failure Modes
Small perturbation budgets and high chat diversity can produce low symptom rates (α) even if retrieval (β) is high.
Image corruptions (resize/flip/JPEG) can reduce retrieval success and slow spread but may not stop it.

