MARAGE: optimize a short adversarial suffix that makes RAG systems regurgitate retrieved private data across unseen models

February 5, 20258 min

Overview

Production Readiness

0.45

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Xiao Hu, Eric Liu, Weizhou Wang, Xiangyu Guo, David Lie

Links

Abstract / PDF

Why It Matters For Business

Public RAG (search+LLM) services can be probed to leak exact retrieved passages. Simple prompt rules are insufficient. Companies must test extraction attacks, add model-level defenses, and monitor outputs for leaked context.

Summary TLDR

This paper presents MARAGE, an optimization method that builds a universal adversarial suffix which, when appended to a user query, causes RAG systems to output retrieved context verbatim. Key ideas: relax discrete token search by optimizing continuous embeddings, aggregate gradients from multiple models to improve transfer, and use "primacy weighting" to focus loss on initial tokens so long targets are recovered. Evaluations on four RAG datasets and many open models show MARAGE far outperforms manual templates and prior optimizers, transfers to unseen models, and resists simple system-prompt defenses. The attack is efficient (≈48GB GPU for long targets) but needs access to example RAG pairs

Problem Statement

RAG systems add retrieved private data into prompts so attackers who can only submit queries may coerce the generation model into leaking that data verbatim. Existing attacks rely on manual templates or greedy optimization and fail to scale to long retrieved passages or to transfer across models.

Main Contribution

MARAGE: a continuous-relaxation optimization that finds adversarial suffixes causing RAG data to be output verbatim.

Multi-model joint optimization: aggregate gradients from multiple frozen models to make adversarial suffixes transfer to unseen model architectures.

Primacy weighting: emphasize early target tokens with a smooth decay to make long-target extraction reliable.

Key Findings

MARAGE achieves much higher exact-match extraction than manual or prior optimized attacks on diverse RAG data.

NumbersEM up to 0.796 vs manual 0.082 on LLaMA3 (Rag-12000); 12/20 entries EM>0.8

Joint optimization across models gives strong transfer to unseen models.

NumbersJointly optimized ADV achieved EM=1.0 on some targets when transferred (Table 3)

Primacy weighting is critical for generalization to long targets.

NumbersEM 0.796 with decay 0.9 vs EM 0.293 with no decay (Rag-12000, Table 5)

MARAGE influences internal model states through the full generation, unlike prior attacks.

NumbersV-usable information Vi ≈0.982 at layer 31 token 100 vs Pleak Vi ≈0.024 (Table 4)

Simple system-prompt defenses perform poorly against MARAGE.

NumbersEM ≈0.79 with two defense prompts vs EM 0.796 baseline (Table 7)

Results

Exact Match (EM)

ValueEM up to 0.98–1.0 on easier datasets; EM 0.796 on LLaMA3 Rag-12000

Baselinemanual templates / Pleak / GCG

Transfer EM (black-box joint optimization)

ValueJoint ADV produced EM=1.0 on some target/model combos and high EMs across unseen models

Baselinesingle-model optimized ADV

Ablation: primacy weighting

ValueEM 0.796 with decay 0.9 vs EM 0.293 with no decay

Baselineno primacy weighting

Defense robustness

ValueEM ≈0.79 with two system-prompt defenses vs 0.796 baseline

Baselinemanual attack EM 0.014 under same defense

Who Should Care

What To Try In 7 Days

Run a red-team: simulate MARAGE using public surrogate models and a small RAG sample set to measure EM leakage.

Add output filters: block outputs that contain exact substrings of your private store or flag high-semantic-overlap outputs.

Evaluate adversarial training: finetune or filter models to detect adversarial suffixes using probes like V-usable information.

Optimization Features

Token Efficiency

  • continuous-embedding relaxation avoids huge token search

Infra Optimization

  • memory: ≈48GB GPU for long targets vs >1000GB estimated for greedy GCG

System Optimization

  • aggregate gradients across models sharing same embedding size

Reproducibility

Data Urls

  • Rag-12000 (neural-bridge/Falcon RefinedWeb subset)
  • Rag-minibioasq (BioASQ subset)
  • Rag-v1 (glaive-made)
  • Rag-synthetic (chatgpt-generated)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Joint optimization requires models with the same embedding dimension; limits which surrogates you can use.
  • Performance drops on very high-perplexity or malformed-input samples (e.g., certain Unicode) as seen on Mistral.
  • Attack assumes attacker can append arbitrary text to user query and has access to example d∥q pairs for optimization.
  • MARAGE may generate extra trailing text (no clear stop), affecting BLEU and SS though not EM.

When Not To Use

  • If you cannot append arbitrary suffixes to queries (no write access to query content).
  • When production models have been adversarially trained specifically against suffix attacks.
  • If target models use heavy pure-sampling decoding and you only need high exact-match success.

Failure Modes

  • High dataset perplexity or special tokens can break extraction (example failure on Mistral with unicode).
  • Too-short ADV lacks semantics; too-long ADV overfits and reduces transfer.
  • Decoding with sampling or greedy can reduce EM success versus beam-based decoding.

Core Entities

Models

  • LlaMA3-8B-Instruct
  • GPT-J-6B
  • Vicuna-7B-v1.5
  • OPT-6.7B
  • Mistral-7B-v0.3
  • LlaMA-3
  • LlaMA-2
  • Vicuna-33B
  • Qwen2.5/2.57B

Metrics

  • Exact Match (EM)
  • BLEU
  • Extended Edit Distance (EED)
  • Semantic Similarity (SS)

Datasets

  • Rag-12000
  • Rag-minibioasq
  • Rag-v1
  • Rag-synthetic