A formal framework and first quantitative benchmark showing prompt-injection attacks are broadly effective and current defenses fall short

Overview

Decision SnapshotReady For Pilot

The benchmark is well-scoped and reproducible; it shows strong empirical evidence across many models and tasks, but defenses require further work and adaptive attackers are not exhaustively covered.

Citations10

Evidence Strength0.90

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 70%

Authors

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, Neil Zhenqiang Gong

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Any app that feeds external text into an LLM is attackable: attackers can hide instructions in user-provided data and change app outputs. Benchmarks show attacks work across models and tasks, and many defenses either miss attacks or break normal performance.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The paper gives a clear formal definition and an implementation framework for prompt injection (attacker instructions embedded in data). It builds a reproducible benchmark: 5 attack styles (including a new combined attack), 10 defenses (prevention and detection), 10 LLMs, and 7 NLP tasks. Results show combined attacks are the strongest and work across models and tasks (ASV ~0.62, MR ~0.78 averaged over LLMs), and no evaluated defense reliably blocks attacks without hurting normal utility. The authors open-source their testbed.

Problem Statement

LLM-backed apps accept external data and instruction prompts. Malicious users can hide instructions in data (prompt injection) to make the app do what the attacker wants. Prior work was ad hoc. We lack a formal definition, a reusable benchmark, and a measured assessment of existing defenses.

Main Contribution

A formal definition and modular framework for prompt injection attacks that unifies existing attack patterns.

Design of a new Combined Attack that mixes escape characters, context-ignoring text, and fake completions.

Key Findings

Prompt injection attacks are broadly effective across tasks and models.

NumbersCombined Attack ASV=0.62 and MR=0.78 averaged over 10 LLMs and 7×7 task pairs

Practical UseTreat data from external sources as adversarial by default and test LLM pipelines under injected-data scenarios.

Evidence RefSec.6.2; Table 5 (results summary)

A Combined Attack (mixing strategies) outperforms single-strategy attacks on GPT-4.

NumbersGPT-4 averaged ASV: Combined=0.75 vs Fake Completion=0.70 and Naive=0.62

Practical UseDefenses must be evaluated against composite attacks, not just simple concatenation or single tricks.

Evidence RefTable 4; Fig.2 (GPT-4 ASV comparison)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ASV (Combined Attack, GPT-4 averaged over 7×7 pairs)	0.75	Other attacks (Fake Completion 0.70; Naive 0.62)	Combined +0.05 vs Fake Completion	Average over 7 target × 7 injected tasks	Table 4; Fig.2	Table 4
ASV and MR (Combined Attack averaged over 10 LLMs and 7×7 tasks)	ASV=0.62; MR=0.78	No defense	—	Average across 10 LLMs and 49 target/injected pairs	Table 5; Sec.6.2	Table 5

What To Try In 7 Days

Run the paper's Combined Attack from the repo against your LLM pipeline to measure ASV and MR on your tasks.

Add a known-answer check as a gating detection and log FPR/FNR; do not rely on perplexity alone.

If you pre-process inputs (paraphrase/retokenize/delimit), measure accuracy on clean inputs to quantify utility loss.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/liu00222/Open-Prompt-Injection

Data URLs

MRPC (GLUE)JflegHSOLRTESST2SMS SpamGigaword

Risks & Boundaries

Limitations

Attacker model assumes the attacker cannot change the instruction prompt and does not know internal prompt details.

All evaluated attacks are heuristic; optimization-based or fully adaptive attacks are left for future work.

When Not To Use

Not intended as a benchmark for jailbreaking/unsafe content elicitation (different threat model).

Not a full evaluation of adaptive attackers who can query the application repeatedly.

Failure Modes

Detection methods that look at text quality (perplexity) miss realistic injected instructions.

Prevention methods (paraphrase/retokenize) can degrade clean-task accuracy, causing unacceptable utility loss.

Core Entities

Models

GPT-4GPT-3.5-TurboPaLM 2 text-bison-001BardVicuna-33b-v1.3Vicuna-13b-v1.3Flan-UL2Llama-2-13b-chatLlama-2-7b-chatInternLM-Chat-7B

Metrics

Attack Success Value (ASV)Matching Rate (MR)Performance under No Attacks (PNA)AccuracyROUGE-1GLEUFalse Positive Rate (FPR)False Negative Rate (FNR)

Datasets

MRPCJflegHSOLRTESST2SMS SpamGigaword

Benchmarks

Prompt-injection benchmark (this paper): 7 target tasks × 7 injected tasks, 10 LLMs, 5 attacks, 10 d

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Prompt injection attacks are broadly effective across tasks and models.

A Combined Attack (mixing strategies) outperforms single-strategy attacks on GPT-4.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding

JudgeDeceiver: automatically craft prompts that reliably trick LLM-as-a-Judge to pick an attacker’s response

Key finding

Make tool-using LLM agents provably safe by combining safety engineering, info-flow labels, and MCP extensions

Key finding

A systematic, practitioner-focused map of 193 multi-agent security threats and how 16 frameworks cover them

Key finding