Overview
Production Readiness
0.5
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
10
Why It Matters For Business
Any app that feeds external text into an LLM is attackable: attackers can hide instructions in user-provided data and change app outputs. Benchmarks show attacks work across models and tasks, and many defenses either miss attacks or break normal performance.
Summary TLDR
The paper gives a clear formal definition and an implementation framework for prompt injection (attacker instructions embedded in data). It builds a reproducible benchmark: 5 attack styles (including a new combined attack), 10 defenses (prevention and detection), 10 LLMs, and 7 NLP tasks. Results show combined attacks are the strongest and work across models and tasks (ASV ~0.62, MR ~0.78 averaged over LLMs), and no evaluated defense reliably blocks attacks without hurting normal utility. The authors open-source their testbed.
Problem Statement
LLM-backed apps accept external data and instruction prompts. Malicious users can hide instructions in data (prompt injection) to make the app do what the attacker wants. Prior work was ad hoc. We lack a formal definition, a reusable benchmark, and a measured assessment of existing defenses.
Main Contribution
A formal definition and modular framework for prompt injection attacks that unifies existing attack patterns.
Design of a new Combined Attack that mixes escape characters, context-ignoring text, and fake completions.
A public benchmark and platform that measures 5 attacks, 10 defenses, 10 LLMs, and 7 tasks (code released).
A systematic evaluation showing attacks are effective across models/tasks and that current defenses have major gaps.
Key Findings
Prompt injection attacks are broadly effective across tasks and models.
A Combined Attack (mixing strategies) outperforms single-strategy attacks on GPT-4.
No evaluated prevention defense reliably blocks attacks without harming normal utility.
Detection defenses trade off misses and false alarms; only known-answer detection shows practical promise but still misses cases.
Perplexity-based detectors often fail to flag compromised data.
Results
ASV (Combined Attack, GPT-4 averaged over 7×7 pairs)
ASV and MR (Combined Attack averaged over 10 LLMs and 7×7 tasks)
PNA-T utility loss (Paraphrasing defense)
Detection FNR (Perplexity based)
Known-answer detection (avg)
Who Should Care
What To Try In 7 Days
Run the paper's Combined Attack from the repo against your LLM pipeline to measure ASV and MR on your tasks.
Add a known-answer check as a gating detection and log FPR/FNR; do not rely on perplexity alone.
If you pre-process inputs (paraphrase/retokenize/delimit), measure accuracy on clean inputs to quantify utility loss.
Reproducibility
Data Urls
- MRPC (GLUE)
- Jfleg
- HSOL
- RTE
- SST2
- SMS Spam
- Gigaword
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Attacker model assumes the attacker cannot change the instruction prompt and does not know internal prompt details.
- All evaluated attacks are heuristic; optimization-based or fully adaptive attacks are left for future work.
- Some concurrent defenses (task-specific fine-tuning) were not evaluated in this benchmark.
When Not To Use
- Not intended as a benchmark for jailbreaking/unsafe content elicitation (different threat model).
- Not a full evaluation of adaptive attackers who can query the application repeatedly.
- Not directly applicable to multimodal inputs (paper uses text-only datasets).
Failure Modes
- Detection methods that look at text quality (perplexity) miss realistic injected instructions.
- Prevention methods (paraphrase/retokenize) can degrade clean-task accuracy, causing unacceptable utility loss.
- Known-answer detection can be overwritten by some injected samples and misses a nontrivial fraction.
Core Entities
Models
- GPT-4
- GPT-3.5-Turbo
- PaLM 2 text-bison-001
- Bard
- Vicuna-33b-v1.3
- Vicuna-13b-v1.3
- Flan-UL2
- Llama-2-13b-chat
- Llama-2-7b-chat
- InternLM-Chat-7B
Metrics
- Attack Success Value (ASV)
- Matching Rate (MR)
- Performance under No Attacks (PNA)
- Accuracy
- ROUGE-1
- GLEU
- False Positive Rate (FPR)
- False Negative Rate (FNR)
Datasets
- MRPC
- Jfleg
- HSOL
- RTE
- SST2
- SMS Spam
- Gigaword
Benchmarks
- Prompt-injection benchmark (this paper): 7 target tasks × 7 injected tasks, 10 LLMs, 5 attacks, 10 d

