A formal framework and first quantitative benchmark showing prompt-injection attacks are broadly effective and current defenses fall short

October 19, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

10

Authors

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, Neil Zhenqiang Gong

Links

Abstract / PDF

Why It Matters For Business

Any app that feeds external text into an LLM is attackable: attackers can hide instructions in user-provided data and change app outputs. Benchmarks show attacks work across models and tasks, and many defenses either miss attacks or break normal performance.

Summary TLDR

The paper gives a clear formal definition and an implementation framework for prompt injection (attacker instructions embedded in data). It builds a reproducible benchmark: 5 attack styles (including a new combined attack), 10 defenses (prevention and detection), 10 LLMs, and 7 NLP tasks. Results show combined attacks are the strongest and work across models and tasks (ASV ~0.62, MR ~0.78 averaged over LLMs), and no evaluated defense reliably blocks attacks without hurting normal utility. The authors open-source their testbed.

Problem Statement

LLM-backed apps accept external data and instruction prompts. Malicious users can hide instructions in data (prompt injection) to make the app do what the attacker wants. Prior work was ad hoc. We lack a formal definition, a reusable benchmark, and a measured assessment of existing defenses.

Main Contribution

A formal definition and modular framework for prompt injection attacks that unifies existing attack patterns.

Design of a new Combined Attack that mixes escape characters, context-ignoring text, and fake completions.

A public benchmark and platform that measures 5 attacks, 10 defenses, 10 LLMs, and 7 tasks (code released).

A systematic evaluation showing attacks are effective across models/tasks and that current defenses have major gaps.

Key Findings

Prompt injection attacks are broadly effective across tasks and models.

NumbersCombined Attack ASV=0.62 and MR=0.78 averaged over 10 LLMs and 7×7 task pairs

A Combined Attack (mixing strategies) outperforms single-strategy attacks on GPT-4.

NumbersGPT-4 averaged ASV: Combined=0.75 vs Fake Completion=0.70 and Naive=0.62

No evaluated prevention defense reliably blocks attacks without harming normal utility.

NumbersParaphrasing reduces target-task utility on average by 0.14 (PNA-T drop)

Detection defenses trade off misses and false alarms; only known-answer detection shows practical promise but still misses cases.

NumbersKnown-answer detection: FNR ≈0.04 (avg) and FPR≈0.01 (avg) across tasks in reported runs

Perplexity-based detectors often fail to flag compromised data.

NumbersPPL detection FNR often ≥0.77 and up to 1.0 on several tasks

Results

ASV (Combined Attack, GPT-4 averaged over 7×7 pairs)

Value0.75

BaselineOther attacks (Fake Completion 0.70; Naive 0.62)

ASV and MR (Combined Attack averaged over 10 LLMs and 7×7 tasks)

ValueASV=0.62; MR=0.78

BaselineNo defense

PNA-T utility loss (Paraphrasing defense)

Valueaverage -0.14

BaselinePNA-T without defense

Detection FNR (Perplexity based)

Valueoften ≥0.77 and up to 1.0 on several tasks

BaselineExpected low FNR detector

Known-answer detection (avg)

ValueFNR ≈0.04; FPR ≈0.01

BaselineOther detectors

Who Should Care

What To Try In 7 Days

Run the paper's Combined Attack from the repo against your LLM pipeline to measure ASV and MR on your tasks.

Add a known-answer check as a gating detection and log FPR/FNR; do not rely on perplexity alone.

If you pre-process inputs (paraphrase/retokenize/delimit), measure accuracy on clean inputs to quantify utility loss.

Reproducibility

Data Urls

  • MRPC (GLUE)
  • Jfleg
  • HSOL
  • RTE
  • SST2
  • SMS Spam
  • Gigaword

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Attacker model assumes the attacker cannot change the instruction prompt and does not know internal prompt details.
  • All evaluated attacks are heuristic; optimization-based or fully adaptive attacks are left for future work.
  • Some concurrent defenses (task-specific fine-tuning) were not evaluated in this benchmark.

When Not To Use

  • Not intended as a benchmark for jailbreaking/unsafe content elicitation (different threat model).
  • Not a full evaluation of adaptive attackers who can query the application repeatedly.
  • Not directly applicable to multimodal inputs (paper uses text-only datasets).

Failure Modes

  • Detection methods that look at text quality (perplexity) miss realistic injected instructions.
  • Prevention methods (paraphrase/retokenize) can degrade clean-task accuracy, causing unacceptable utility loss.
  • Known-answer detection can be overwritten by some injected samples and misses a nontrivial fraction.

Core Entities

Models

  • GPT-4
  • GPT-3.5-Turbo
  • PaLM 2 text-bison-001
  • Bard
  • Vicuna-33b-v1.3
  • Vicuna-13b-v1.3
  • Flan-UL2
  • Llama-2-13b-chat
  • Llama-2-7b-chat
  • InternLM-Chat-7B

Metrics

  • Attack Success Value (ASV)
  • Matching Rate (MR)
  • Performance under No Attacks (PNA)
  • Accuracy
  • ROUGE-1
  • GLEU
  • False Positive Rate (FPR)
  • False Negative Rate (FNR)

Datasets

  • MRPC
  • Jfleg
  • HSOL
  • RTE
  • SST2
  • SMS Spam
  • Gigaword

Benchmarks

  • Prompt-injection benchmark (this paper): 7 target tasks × 7 injected tasks, 10 LLMs, 5 attacks, 10 d