Overview
Results are strong across multiple open models and datasets and backed by layerwise probes, but joint optimization requires matching embedding sizes and the attack struggles on very high-perplexity or tokenization-robust models.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 45%
Novelty: 60%
Why It Matters For Business
Public RAG (search+LLM) services can be probed to leak exact retrieved passages. Simple prompt rules are insufficient. Companies must test extraction attacks, add model-level defenses, and monitor outputs for leaked context.
Who Should Care
Summary TLDR
This paper presents MARAGE, an optimization method that builds a universal adversarial suffix which, when appended to a user query, causes RAG systems to output retrieved context verbatim. Key ideas: relax discrete token search by optimizing continuous embeddings, aggregate gradients from multiple models to improve transfer, and use "primacy weighting" to focus loss on initial tokens so long targets are recovered. Evaluations on four RAG datasets and many open models show MARAGE far outperforms manual templates and prior optimizers, transfers to unseen models, and resists simple system-prompt defenses. The attack is efficient (≈48GB GPU for long targets) but needs access to example RAG pairs
Problem Statement
RAG systems add retrieved private data into prompts so attackers who can only submit queries may coerce the generation model into leaking that data verbatim. Existing attacks rely on manual templates or greedy optimization and fail to scale to long retrieved passages or to transfer across models.
Main Contribution
MARAGE: a continuous-relaxation optimization that finds adversarial suffixes causing RAG data to be output verbatim.
Multi-model joint optimization: aggregate gradients from multiple frozen models to make adversarial suffixes transfer to unseen model architectures.
Key Findings
MARAGE achieves much higher exact-match extraction than manual or prior optimized attacks on diverse RAG data.
Joint optimization across models gives strong transfer to unseen models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Exact Match (EM) | EM up to 0.98–1.0 on easier datasets; EM 0.796 on LLaMA3 Rag-12000 | manual templates / Pleak / GCG | substantial (e.g., 0.796 vs 0.082 manual on LLaMA3 Rag-12000) | Table 2 (Rag-12000, Rag-minibioasq, Rag-v1, Rag-synthetic) | MARAGE outperforms baselines across models and datasets | Table 2 |
| Transfer EM (black-box joint optimization) | Joint ADV produced EM=1.0 on some target/model combos and high EMs across unseen models | single-model optimized ADV | improved transferability versus single-model optimization | Rag-minibioasq (Table 3) | Joint optimization across LlaMA3-Instruct, GPT-J, OPT gave EM=1.0 on some transfers | Table 3 |
What To Try In 7 Days
Run a red-team: simulate MARAGE using public surrogate models and a small RAG sample set to measure EM leakage.
Add output filters: block outputs that contain exact substrings of your private store or flag high-semantic-overlap outputs.
Evaluate adversarial training: finetune or filter models to detect adversarial suffixes using probes like V-usable information.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Joint optimization requires models with the same embedding dimension; limits which surrogates you can use.
Performance drops on very high-perplexity or malformed-input samples (e.g., certain Unicode) as seen on Mistral.
When Not To Use
If you cannot append arbitrary suffixes to queries (no write access to query content).
When production models have been adversarially trained specifically against suffix attacks.
Failure Modes
High dataset perplexity or special tokens can break extraction (example failure on Mistral with unicode).
Too-short ADV lacks semantics; too-long ADV overfits and reduces transfer.

