MARAGE: optimize a short adversarial suffix that makes RAG systems regurgitate retrieved private data across unseen models

February 5, 20258 min

Overview

Decision SnapshotNeeds Validation

Results are strong across multiple open models and datasets and backed by layerwise probes, but joint optimization requires matching embedding sizes and the attack struggles on very high-perplexity or tokenization-robust models.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 45%

Novelty: 60%

Authors

Xiao Hu, Eric Liu, Weizhou Wang, Xiangyu Guo, David Lie

Links

Abstract / PDF / Data

Why It Matters For Business

Public RAG (search+LLM) services can be probed to leak exact retrieved passages. Simple prompt rules are insufficient. Companies must test extraction attacks, add model-level defenses, and monitor outputs for leaked context.

Who Should Care

Summary TLDR

This paper presents MARAGE, an optimization method that builds a universal adversarial suffix which, when appended to a user query, causes RAG systems to output retrieved context verbatim. Key ideas: relax discrete token search by optimizing continuous embeddings, aggregate gradients from multiple models to improve transfer, and use "primacy weighting" to focus loss on initial tokens so long targets are recovered. Evaluations on four RAG datasets and many open models show MARAGE far outperforms manual templates and prior optimizers, transfers to unseen models, and resists simple system-prompt defenses. The attack is efficient (≈48GB GPU for long targets) but needs access to example RAG pairs

Problem Statement

RAG systems add retrieved private data into prompts so attackers who can only submit queries may coerce the generation model into leaking that data verbatim. Existing attacks rely on manual templates or greedy optimization and fail to scale to long retrieved passages or to transfer across models.

Main Contribution

MARAGE: a continuous-relaxation optimization that finds adversarial suffixes causing RAG data to be output verbatim.

Multi-model joint optimization: aggregate gradients from multiple frozen models to make adversarial suffixes transfer to unseen model architectures.

Key Findings

MARAGE achieves much higher exact-match extraction than manual or prior optimized attacks on diverse RAG data.

NumbersEM up to 0.796 vs manual 0.082 on LLaMA3 (Rag-12000); 12/20 entries EM>0.8

Practical UseIf you run a public RAG service, attackers can programmatically recover long retrieved passages far more reliably than prior methods; simple template fixes are not enough.

Evidence RefTable 2

Joint optimization across models gives strong transfer to unseen models.

NumbersJointly optimized ADV achieved EM=1.0 on some targets when transferred (Table 3)

Practical UseAn attacker can optimize against a small set of surrogate models and still succeed against different production models.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Exact Match (EM)EM up to 0.981.0 on easier datasets; EM 0.796 on LLaMA3 Rag-12000manual templates / Pleak / GCGsubstantial (e.g., 0.796 vs 0.082 manual on LLaMA3 Rag-12000)Table 2 (Rag-12000, Rag-minibioasq, Rag-v1, Rag-synthetic)MARAGE outperforms baselines across models and datasetsTable 2
Transfer EM (black-box joint optimization)Joint ADV produced EM=1.0 on some target/model combos and high EMs across unseen modelssingle-model optimized ADVimproved transferability versus single-model optimizationRag-minibioasq (Table 3)Joint optimization across LlaMA3-Instruct, GPT-J, OPT gave EM=1.0 on some transfersTable 3

What To Try In 7 Days

Run a red-team: simulate MARAGE using public surrogate models and a small RAG sample set to measure EM leakage.

Add output filters: block outputs that contain exact substrings of your private store or flag high-semantic-overlap outputs.

Evaluate adversarial training: finetune or filter models to detect adversarial suffixes using probes like V-usable information.

Optimization Features

Token Efficiency
continuous-embedding relaxation avoids huge token search
Infra Optimization
memory: ≈48GB GPU for long targets vs >1000GB estimated for greedy GCG
System Optimization
aggregate gradients across models sharing same embedding size

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Rag-12000 (neural-bridge/Falcon RefinedWeb subset)Rag-minibioasq (BioASQ subset)Rag-v1 (glaive-made)Rag-synthetic (chatgpt-generated)

Risks & Boundaries

Limitations

Joint optimization requires models with the same embedding dimension; limits which surrogates you can use.

Performance drops on very high-perplexity or malformed-input samples (e.g., certain Unicode) as seen on Mistral.

When Not To Use

If you cannot append arbitrary suffixes to queries (no write access to query content).

When production models have been adversarially trained specifically against suffix attacks.

Failure Modes

High dataset perplexity or special tokens can break extraction (example failure on Mistral with unicode).

Too-short ADV lacks semantics; too-long ADV overfits and reduces transfer.

Core Entities

Models

LlaMA3-8B-InstructGPT-J-6BVicuna-7B-v1.5OPT-6.7BMistral-7B-v0.3LlaMA-3LlaMA-2Vicuna-33BQwen2.5/2.57B

Metrics

Exact Match (EM)BLEUExtended Edit Distance (EED)Semantic Similarity (SS)

Datasets

Rag-12000Rag-minibioasqRag-v1Rag-synthetic