Overview
Production Readiness
0.45
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Public RAG (search+LLM) services can be probed to leak exact retrieved passages. Simple prompt rules are insufficient. Companies must test extraction attacks, add model-level defenses, and monitor outputs for leaked context.
Summary TLDR
This paper presents MARAGE, an optimization method that builds a universal adversarial suffix which, when appended to a user query, causes RAG systems to output retrieved context verbatim. Key ideas: relax discrete token search by optimizing continuous embeddings, aggregate gradients from multiple models to improve transfer, and use "primacy weighting" to focus loss on initial tokens so long targets are recovered. Evaluations on four RAG datasets and many open models show MARAGE far outperforms manual templates and prior optimizers, transfers to unseen models, and resists simple system-prompt defenses. The attack is efficient (≈48GB GPU for long targets) but needs access to example RAG pairs
Problem Statement
RAG systems add retrieved private data into prompts so attackers who can only submit queries may coerce the generation model into leaking that data verbatim. Existing attacks rely on manual templates or greedy optimization and fail to scale to long retrieved passages or to transfer across models.
Main Contribution
MARAGE: a continuous-relaxation optimization that finds adversarial suffixes causing RAG data to be output verbatim.
Multi-model joint optimization: aggregate gradients from multiple frozen models to make adversarial suffixes transfer to unseen model architectures.
Primacy weighting: emphasize early target tokens with a smooth decay to make long-target extraction reliable.
Key Findings
MARAGE achieves much higher exact-match extraction than manual or prior optimized attacks on diverse RAG data.
Joint optimization across models gives strong transfer to unseen models.
Primacy weighting is critical for generalization to long targets.
MARAGE influences internal model states through the full generation, unlike prior attacks.
Simple system-prompt defenses perform poorly against MARAGE.
Results
Exact Match (EM)
Transfer EM (black-box joint optimization)
Ablation: primacy weighting
Defense robustness
Who Should Care
What To Try In 7 Days
Run a red-team: simulate MARAGE using public surrogate models and a small RAG sample set to measure EM leakage.
Add output filters: block outputs that contain exact substrings of your private store or flag high-semantic-overlap outputs.
Evaluate adversarial training: finetune or filter models to detect adversarial suffixes using probes like V-usable information.
Optimization Features
Token Efficiency
- continuous-embedding relaxation avoids huge token search
Infra Optimization
- memory: ≈48GB GPU for long targets vs >1000GB estimated for greedy GCG
System Optimization
- aggregate gradients across models sharing same embedding size
Reproducibility
Data Urls
- Rag-12000 (neural-bridge/Falcon RefinedWeb subset)
- Rag-minibioasq (BioASQ subset)
- Rag-v1 (glaive-made)
- Rag-synthetic (chatgpt-generated)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Joint optimization requires models with the same embedding dimension; limits which surrogates you can use.
- Performance drops on very high-perplexity or malformed-input samples (e.g., certain Unicode) as seen on Mistral.
- Attack assumes attacker can append arbitrary text to user query and has access to example d∥q pairs for optimization.
- MARAGE may generate extra trailing text (no clear stop), affecting BLEU and SS though not EM.
When Not To Use
- If you cannot append arbitrary suffixes to queries (no write access to query content).
- When production models have been adversarially trained specifically against suffix attacks.
- If target models use heavy pure-sampling decoding and you only need high exact-match success.
Failure Modes
- High dataset perplexity or special tokens can break extraction (example failure on Mistral with unicode).
- Too-short ADV lacks semantics; too-long ADV overfits and reduces transfer.
- Decoding with sampling or greedy can reduce EM success versus beam-based decoding.
Core Entities
Models
- LlaMA3-8B-Instruct
- GPT-J-6B
- Vicuna-7B-v1.5
- OPT-6.7B
- Mistral-7B-v0.3
- LlaMA-3
- LlaMA-2
- Vicuna-33B
- Qwen2.5/2.57B
Metrics
- Exact Match (EM)
- BLEU
- Extended Edit Distance (EED)
- Semantic Similarity (SS)
Datasets
- Rag-12000
- Rag-minibioasq
- Rag-v1
- Rag-synthetic

