A practical black-box method that forces poisoned documents into retrieval and hijacks RAG and agentic systems

January 11, 202610 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Hongyan Chang, Ergute Bao, Xinjian Luo, Ting Yu

Links

Abstract / PDF

Why It Matters For Business

If your product uses embedding-based retrieval and allows external or user-supplied documents, an attacker can cheaply force a poisoned document into search results and trigger downstream harms (phishing, data exfiltration, tool misuse). Protect retrieval and write access, not just the model.

Summary TLDR

The paper shows indirect prompt injection (IPI) is a practical end-to-end threat once an attacker makes poisoned documents be retrieved. It splits a malicious document into a short trigger (few tokens) that improves retrieval and an attack fragment with the payload. Using a black-box Cross-Entropy Method (CEM) to optimize triggers via embedding APIs, the authors achieve near-100% Recall@5 across 11 BEIR datasets and 8 embedding models, low cost (~$0.21 per target on some APIs), and end-to-end exploits (RAG and multi-agent) including SSH-key exfiltration with ~80% success in a multi-agent pipeline. Simple defenses (paraphrasing, perplexity filtering, token masking) fail once the attacker adap

Problem Statement

Modern LLM systems use external retrieval and can be hijacked if poisoned documents are returned. Previous work often assumes the malicious text is already retrieved. This paper asks: can an attacker reliably make a malicious item be retrieved under natural queries and realistic corpora when the attacker only has black-box access to embedding APIs and can inject a single document?

Main Contribution

Formulate IPI as two pieces: a compact trigger fragment (ensures retrieval) and an attack fragment (payload).

Design a practical black-box prefix-optimization attack (CEM variant) that builds short triggers (5–15 tokens) via only embedding API calls.

Large-scale empirical study: near-perfect retrieval across 11 BEIR datasets and 8 embedding models, low monetary cost (~$0.21 per query on some APIs).

First end-to-end exploits on RAG and agentic systems, including multi-agent SSH key exfiltration (~80% ASR), and evaluation showing common lightweight defenses are insufficient.

Key Findings

A short, optimized trigger reliably surfaces a single poisoned document into top-K retrieval.

NumbersRecall@5 ≈ 95% average across 11 BEIR datasets at n=10 tokens

The attack is low-cost and fast using commercial embedding APIs.

NumbersTrigger generation costs $0.21–$0.76 per target on evaluated APIs

End-to-end exploitation succeeds across RAG and agentic pipelines including multi-agent orchestration.

NumbersMulti-agent SSH key exfiltration ASR up to ~80%

Success depends on corpus competition (how relevant clean docs are).

NumbersAverage competition similarity ~0.64 for always-success datasets vs ~0.82 where attacks fail

Popular lightweight defenses are easily bypassed by adaptive attackers.

NumbersParaphrasing drops Recall@5 <10% typically; adaptive optimization restores or exceeds original performance

Perplexity-based filtering is fragile to trivial changes.

NumbersMalicious PPL 154.1 vs clean 46.6; single repetition drops malicious PPL to 14.4

Results

Recall@5 (retrieval)

Value≈95% average across 11 BEIR datasets at n=10 tokens

BaselineVanilla (no trigger) ≈0%

Attack cost (commercial APIs)

Value$0.21 per target (Voyage/OpenAI) to $0.76 (Qwen-v4)

End-to-end ASR (multi-agent code exfiltration)

Value≈80% success in multi-agent pipeline (GPT-4o)

BaselineIdeal 'already-in-context' ASR much lower (~58%) in some cases

RAG targeted-answer ASR

ValueOften close to 1.0 for many LLMs/datasets when trigger is retrieved

BaselineWithout trigger ASR ≈0

Transferability (prefix from OpenAI embeddings)

ValueAvg recall ~74% across target models

Defense robustness (paraphrase)

ValueParaphrasing causes <10% drop in Recall@5 typically; adaptive attack restores performance

Who Should Care

What To Try In 7 Days

Audit who can add documents to your retriever and block untrusted writers.

Log and monitor top-K retrieval outputs for high-sensitivity queries and alerts.

Run a red-team: generate a 10-token trigger for a few critical queries using an embedding API to test your corpus' vulnerability locally (use public BEIR/Enron samples). Do this in

Agent Features

Memory

  • retrieval memory (external corpus)

Planning

  • tool call planning
  • round-robin scheduling
  • orchestration across agents

Tool Use

  • retrieval
  • send_email
  • contact_list access
  • python code execution

Frameworks

  • AutoGen
  • MagenticOne
  • Model Context Protocol (MCP)

Is Agentic

true

Architectures

  • single-agent
  • multi-agent (orchestrator + specialist agents)

Collaboration

  • agent-to-agent delegation
  • multi-agent information passing

Optimization Features

Token Efficiency

  • effective triggers as short as 5–10 tokens
  • longer triggers (≈15) increase success on harder corpora

System Optimization

  • black-box API-only attack (no gradient access)
  • sampling-based CEM avoids combinatorial search

Reproducibility

Data Urls

  • BEIR benchmark (public)
  • Enron email corpus (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Does not evaluate retriever pipelines that use rerankers or hybrid (embedding + lexical) search in depth.
  • Transferability across different embedding architectures is not guaranteed; full black-box transfer requires attacker knowledge/guessing.
  • Assumes attacker can inject at least one document into the corpus; closed write policies reduce attack surface.
  • Defense evaluation excludes approaches that require retriever fine-tuning or model parameter changes.

When Not To Use

  • If your retriever uses strong hybrid reranking or supervised rerankers that re-score candidates before returning them.
  • If external corpora are fully write-restricted and only vetted ingest pipelines accept documents.
  • If your deployment includes per-document cryptographic provenance or strict ingestion validation.

Failure Modes

  • High corpus competition: many highly relevant clean documents can resist trigger insertion.
  • Embedding models with strong position encoding (e.g., OpenAI in tests) can reduce transferability and dispersion attacks.
  • Reranking layers that rely on features outside the raw embedding (e.g., lexical matches, supervised signals) can negate the optimized trigger.

Core Entities

Models

  • gte-modernbert-base (ModernBERT)
  • contriever-msmarco
  • Qwen3-Embedding-0.6B
  • Qwen3-Embedding-4B
  • Qwen3-Embedding-8B
  • OpenAI text-embedding-3-small
  • VoyageAI voyage-3.5-lite
  • Alibaba text-embedding-v4 (Qwen-v4)
  • ViT-B-32 (OpenCLIP for image-text demo)
  • GPT-4o
  • GPT-4o-mini
  • LLaMA-2-7B
  • Vicuna (7B/13B)
  • Qwen3 series
  • AutoGen
  • MagenticOne

Metrics

  • Recall@5
  • MRR@5
  • nDCG@5
  • Cosine similarity
  • Attack Success Rate (ASR)
  • Monetary cost per trigger generation

Datasets

  • BEIR (11 corpora: MSMARCO, TREC-COVID, NFCorpus, NQ, HotpotQA, FiQA-2018, ArguAna, DBPedia, SCIDOCS,
  • MS COCO (image-to-text demo)
  • Enron email corpus (agent experiments)

Benchmarks

  • BEIR
  • MS COCO (cross-modal retrieval)