A practical black-box method that forces poisoned documents into retrieval and hijacks RAG and agentic systems

Overview

Decision SnapshotReady For Pilot

The method adapts known CEM optimization to a new threat (trigger construction) and demonstrates consistent, reproducible results across many datasets and models; experiments include realistic costs and agent pipelines, giving strong practical evidence.

Citations0

Evidence Strength0.90

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Hongyan Chang, Ergute Bao, Xinjian Luo, Ting Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product uses embedding-based retrieval and allows external or user-supplied documents, an attacker can cheaply force a poisoned document into search results and trigger downstream harms (phishing, data exfiltration, tool misuse). Protect retrieval and write access, not just the model.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead

Summary TLDR

The paper shows indirect prompt injection (IPI) is a practical end-to-end threat once an attacker makes poisoned documents be retrieved. It splits a malicious document into a short trigger (few tokens) that improves retrieval and an attack fragment with the payload. Using a black-box Cross-Entropy Method (CEM) to optimize triggers via embedding APIs, the authors achieve near-100% Recall@5 across 11 BEIR datasets and 8 embedding models, low cost (~$0.21 per target on some APIs), and end-to-end exploits (RAG and multi-agent) including SSH-key exfiltration with ~80% success in a multi-agent pipeline. Simple defenses (paraphrasing, perplexity filtering, token masking) fail once the attacker adap

Problem Statement

Modern LLM systems use external retrieval and can be hijacked if poisoned documents are returned. Previous work often assumes the malicious text is already retrieved. This paper asks: can an attacker reliably make a malicious item be retrieved under natural queries and realistic corpora when the attacker only has black-box access to embedding APIs and can inject a single document?

Main Contribution

Formulate IPI as two pieces: a compact trigger fragment (ensures retrieval) and an attack fragment (payload).

Design a practical black-box prefix-optimization attack (CEM variant) that builds short triggers (5–15 tokens) via only embedding API calls.

Key Findings

A short, optimized trigger reliably surfaces a single poisoned document into top-K retrieval.

NumbersRecall@5 ≈ 95% average across 11 BEIR datasets at n=10 tokens

Practical UseIf an attacker can write one document and run embedding queries, a 10-token trigger can make that document appear among top-5 results for most queries; operators must harden retrieval, not just post-retrieval generation.

Evidence RefTable 3; Section 4.1

The attack is low-cost and fast using commercial embedding APIs.

NumbersTrigger generation costs $0.21–$0.76 per target on evaluated APIs

Practical UseAttack feasibility is realistic: small budgets buy reliable retrieval exploits; treat embedding API access and write permissions as high-risk assets.

Evidence RefSection 4.1 (Efficiency and cost)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Recall@5 (retrieval)	≈95% average across 11 BEIR datasets at n=10 tokens	Vanilla (no trigger) ≈0%	Large increase vs baselines (from near-zero to ~95%)	Aggregate over 11 BEIR datasets (100 queries each)	Table 3; Section 4.1	Table 3
Attack cost (commercial APIs)	$0.21 per target (Voyage/OpenAI) to $0.76 (Qwen-v4)	—	—	Cost measured for trigger generation per target query	Section 4.1 (Efficiency and cost)	Section 4.1

What To Try In 7 Days

Audit who can add documents to your retriever and block untrusted writers.

Log and monitor top-K retrieval outputs for high-sensitivity queries and alerts.

Run a red-team: generate a 10-token trigger for a few critical queries using an embedding API to test your corpus' vulnerability locally (use public BEIR/Enron samples). Do this in

Agent Features

Memory

retrieval memory (external corpus)

Planning

tool call planninground-robin schedulingorchestration across agents

Tool Use

retrievalsend_emailcontact_list accesspython code execution

Frameworks

AutoGenMagenticOneModel Context Protocol (MCP)

Is Agentic

Yes

Architectures

single-agentmulti-agent (orchestrator + specialist agents)

Collaboration

agent-to-agent delegationmulti-agent information passing

Optimization Features

Token Efficiency

effective triggers as short as 5–10 tokenslonger triggers (≈15) increase success on harder corpora

System Optimization

black-box API-only attack (no gradient access)sampling-based CEM avoids combinatorial search

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://zenodo.org/records/17968523

Data URLs

BEIR benchmark (public)Enron email corpus (public)

Risks & Boundaries

Limitations

Does not evaluate retriever pipelines that use rerankers or hybrid (embedding + lexical) search in depth.

Transferability across different embedding architectures is not guaranteed; full black-box transfer requires attacker knowledge/guessing.

When Not To Use

If your retriever uses strong hybrid reranking or supervised rerankers that re-score candidates before returning them.

If external corpora are fully write-restricted and only vetted ingest pipelines accept documents.

Failure Modes

High corpus competition: many highly relevant clean documents can resist trigger insertion.

Embedding models with strong position encoding (e.g., OpenAI in tests) can reduce transferability and dispersion attacks.

Core Entities

Models

gte-modernbert-base (ModernBERT)contriever-msmarcoQwen3-Embedding-0.6BQwen3-Embedding-4BQwen3-Embedding-8BOpenAI text-embedding-3-smallVoyageAI voyage-3.5-liteAlibaba text-embedding-v4 (Qwen-v4)ViT-B-32 (OpenCLIP for image-text demo)GPT-4oGPT-4o-miniLLaMA-2-7BVicuna (7B/13B)Qwen3 seriesAutoGenMagenticOne

Metrics

Recall@5MRR@5nDCG@5Cosine similarityAttack Success Rate (ASR)Monetary cost per trigger generation

Datasets

BEIR (11 corpora: MSMARCO, TREC-COVID, NFCorpus, NQ, HotpotQA, FiQA-2018, ArguAna, DBPedia, SCIDOCS,MS COCO (image-to-text demo)Enron email corpus (agent experiments)

Benchmarks

BEIRMS COCO (cross-modal retrieval)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A short, optimized trigger reliably surfaces a single poisoned document into top-K retrieval.

The attack is low-cost and fast using commercial embedding APIs.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

AdversaRiskQA: adversarial factuality benchmark for health, finance, and law

Key finding

Short, natural-looking token sequences can flip LLM judges to say 'Yes' on wrong answers; discovery and a small LoRA defense

Key finding

FACT-BENCH: a 20K-question benchmark that reveals when LLMs forget facts and how exemplars can make them lie

Key finding

RWKU: a stress test for forgetting real-world facts in LLMs using 200 real-person targets and adversarial probes

Key finding

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding