MARAGE: optimize a short adversarial suffix that makes RAG systems regurgitate retrieved private data across unseen models

Overview

Decision SnapshotNeeds Validation

Results are strong across multiple open models and datasets and backed by layerwise probes, but joint optimization requires matching embedding sizes and the attack struggles on very high-perplexity or tokenization-robust models.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 45%

Novelty: 60%

Authors

Xiao Hu, Eric Liu, Weizhou Wang, Xiangyu Guo, David Lie

Links

Abstract / PDF / Data

Why It Matters For Business

Public RAG (search+LLM) services can be probed to leak exact retrieved passages. Simple prompt rules are insufficient. Companies must test extraction attacks, add model-level defenses, and monitor outputs for leaked context.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

This paper presents MARAGE, an optimization method that builds a universal adversarial suffix which, when appended to a user query, causes RAG systems to output retrieved context verbatim. Key ideas: relax discrete token search by optimizing continuous embeddings, aggregate gradients from multiple models to improve transfer, and use "primacy weighting" to focus loss on initial tokens so long targets are recovered. Evaluations on four RAG datasets and many open models show MARAGE far outperforms manual templates and prior optimizers, transfers to unseen models, and resists simple system-prompt defenses. The attack is efficient (≈48GB GPU for long targets) but needs access to example RAG pairs

Problem Statement

RAG systems add retrieved private data into prompts so attackers who can only submit queries may coerce the generation model into leaking that data verbatim. Existing attacks rely on manual templates or greedy optimization and fail to scale to long retrieved passages or to transfer across models.

Main Contribution

MARAGE: a continuous-relaxation optimization that finds adversarial suffixes causing RAG data to be output verbatim.

Multi-model joint optimization: aggregate gradients from multiple frozen models to make adversarial suffixes transfer to unseen model architectures.

Key Findings

MARAGE achieves much higher exact-match extraction than manual or prior optimized attacks on diverse RAG data.

NumbersEM up to 0.796 vs manual 0.082 on LLaMA3 (Rag-12000); 12/20 entries EM>0.8

Practical UseIf you run a public RAG service, attackers can programmatically recover long retrieved passages far more reliably than prior methods; simple template fixes are not enough.

Evidence RefTable 2

Joint optimization across models gives strong transfer to unseen models.

NumbersJointly optimized ADV achieved EM=1.0 on some targets when transferred (Table 3)

Practical UseAn attacker can optimize against a small set of surrogate models and still succeed against different production models.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Exact Match (EM)	EM up to 0.98–1.0 on easier datasets; EM 0.796 on LLaMA3 Rag-12000	manual templates / Pleak / GCG	substantial (e.g., 0.796 vs 0.082 manual on LLaMA3 Rag-12000)	Table 2 (Rag-12000, Rag-minibioasq, Rag-v1, Rag-synthetic)	MARAGE outperforms baselines across models and datasets	Table 2
Transfer EM (black-box joint optimization)	Joint ADV produced EM=1.0 on some target/model combos and high EMs across unseen models	single-model optimized ADV	improved transferability versus single-model optimization	Rag-minibioasq (Table 3)	Joint optimization across LlaMA3-Instruct, GPT-J, OPT gave EM=1.0 on some transfers	Table 3

What To Try In 7 Days

Run a red-team: simulate MARAGE using public surrogate models and a small RAG sample set to measure EM leakage.

Add output filters: block outputs that contain exact substrings of your private store or flag high-semantic-overlap outputs.

Evaluate adversarial training: finetune or filter models to detect adversarial suffixes using probes like V-usable information.

Optimization Features

Token Efficiency

continuous-embedding relaxation avoids huge token search

Infra Optimization

memory: ≈48GB GPU for long targets vs >1000GB estimated for greedy GCG

System Optimization

aggregate gradients across models sharing same embedding size

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

Rag-12000 (neural-bridge/Falcon RefinedWeb subset)Rag-minibioasq (BioASQ subset)Rag-v1 (glaive-made)Rag-synthetic (chatgpt-generated)

Risks & Boundaries

Limitations

Joint optimization requires models with the same embedding dimension; limits which surrogates you can use.

Performance drops on very high-perplexity or malformed-input samples (e.g., certain Unicode) as seen on Mistral.

When Not To Use

If you cannot append arbitrary suffixes to queries (no write access to query content).

When production models have been adversarially trained specifically against suffix attacks.

Failure Modes

High dataset perplexity or special tokens can break extraction (example failure on Mistral with unicode).

Too-short ADV lacks semantics; too-long ADV overfits and reduces transfer.

Core Entities

Models

LlaMA3-8B-InstructGPT-J-6BVicuna-7B-v1.5OPT-6.7BMistral-7B-v0.3LlaMA-3LlaMA-2Vicuna-33BQwen2.5/2.57B

Metrics

Exact Match (EM)BLEUExtended Edit Distance (EED)Semantic Similarity (SS)

Datasets

Rag-12000Rag-minibioasqRag-v1Rag-synthetic

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MARAGE achieves much higher exact-match extraction than manual or prior optimized attacks on diverse RAG data.

Joint optimization across models gives strong transfer to unseen models.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Fine-tune LLMs to ignore misleading retrieved documents and cut RAG hallucinations by ~21%

Key finding

17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Key finding

LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Key finding

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Key finding