Overview
The benchmark is well-scoped and reproducible; it shows strong empirical evidence across many models and tasks, but defenses require further work and adaptive attackers are not exhaustively covered.
Citations10
Evidence Strength0.90
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 70%
Why It Matters For Business
Any app that feeds external text into an LLM is attackable: attackers can hide instructions in user-provided data and change app outputs. Benchmarks show attacks work across models and tasks, and many defenses either miss attacks or break normal performance.
Who Should Care
Summary TLDR
The paper gives a clear formal definition and an implementation framework for prompt injection (attacker instructions embedded in data). It builds a reproducible benchmark: 5 attack styles (including a new combined attack), 10 defenses (prevention and detection), 10 LLMs, and 7 NLP tasks. Results show combined attacks are the strongest and work across models and tasks (ASV ~0.62, MR ~0.78 averaged over LLMs), and no evaluated defense reliably blocks attacks without hurting normal utility. The authors open-source their testbed.
Problem Statement
LLM-backed apps accept external data and instruction prompts. Malicious users can hide instructions in data (prompt injection) to make the app do what the attacker wants. Prior work was ad hoc. We lack a formal definition, a reusable benchmark, and a measured assessment of existing defenses.
Main Contribution
A formal definition and modular framework for prompt injection attacks that unifies existing attack patterns.
Design of a new Combined Attack that mixes escape characters, context-ignoring text, and fake completions.
Key Findings
Prompt injection attacks are broadly effective across tasks and models.
A Combined Attack (mixing strategies) outperforms single-strategy attacks on GPT-4.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ASV (Combined Attack, GPT-4 averaged over 7×7 pairs) | 0.75 | Other attacks (Fake Completion 0.70; Naive 0.62) | Combined +0.05 vs Fake Completion | Average over 7 target × 7 injected tasks | Table 4; Fig.2 | Table 4 |
| ASV and MR (Combined Attack averaged over 10 LLMs and 7×7 tasks) | ASV=0.62; MR=0.78 | No defense | — | Average across 10 LLMs and 49 target/injected pairs | Table 5; Sec.6.2 | Table 5 |
What To Try In 7 Days
Run the paper's Combined Attack from the repo against your LLM pipeline to measure ASV and MR on your tasks.
Add a known-answer check as a gating detection and log FPR/FNR; do not rely on perplexity alone.
If you pre-process inputs (paraphrase/retokenize/delimit), measure accuracy on clean inputs to quantify utility loss.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Attacker model assumes the attacker cannot change the instruction prompt and does not know internal prompt details.
All evaluated attacks are heuristic; optimization-based or fully adaptive attacks are left for future work.
When Not To Use
Not intended as a benchmark for jailbreaking/unsafe content elicitation (different threat model).
Not a full evaluation of adaptive attackers who can query the application repeatedly.
Failure Modes
Detection methods that look at text quality (perplexity) miss realistic injected instructions.
Prevention methods (paraphrase/retokenize) can degrade clean-task accuracy, causing unacceptable utility loss.

