A formal framework and first quantitative benchmark showing prompt-injection attacks are broadly effective and current defenses fall short

October 19, 20238 min

Overview

Decision SnapshotReady For Pilot

The benchmark is well-scoped and reproducible; it shows strong empirical evidence across many models and tasks, but defenses require further work and adaptive attackers are not exhaustively covered.

Citations10

Evidence Strength0.90

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 70%

Authors

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, Neil Zhenqiang Gong

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Any app that feeds external text into an LLM is attackable: attackers can hide instructions in user-provided data and change app outputs. Benchmarks show attacks work across models and tasks, and many defenses either miss attacks or break normal performance.

Who Should Care

Summary TLDR

The paper gives a clear formal definition and an implementation framework for prompt injection (attacker instructions embedded in data). It builds a reproducible benchmark: 5 attack styles (including a new combined attack), 10 defenses (prevention and detection), 10 LLMs, and 7 NLP tasks. Results show combined attacks are the strongest and work across models and tasks (ASV ~0.62, MR ~0.78 averaged over LLMs), and no evaluated defense reliably blocks attacks without hurting normal utility. The authors open-source their testbed.

Problem Statement

LLM-backed apps accept external data and instruction prompts. Malicious users can hide instructions in data (prompt injection) to make the app do what the attacker wants. Prior work was ad hoc. We lack a formal definition, a reusable benchmark, and a measured assessment of existing defenses.

Main Contribution

A formal definition and modular framework for prompt injection attacks that unifies existing attack patterns.

Design of a new Combined Attack that mixes escape characters, context-ignoring text, and fake completions.

Key Findings

Prompt injection attacks are broadly effective across tasks and models.

NumbersCombined Attack ASV=0.62 and MR=0.78 averaged over 10 LLMs and 7×7 task pairs

Practical UseTreat data from external sources as adversarial by default and test LLM pipelines under injected-data scenarios.

Evidence RefSec.6.2; Table 5 (results summary)

A Combined Attack (mixing strategies) outperforms single-strategy attacks on GPT-4.

NumbersGPT-4 averaged ASV: Combined=0.75 vs Fake Completion=0.70 and Naive=0.62

Practical UseDefenses must be evaluated against composite attacks, not just simple concatenation or single tricks.

Evidence RefTable 4; Fig.2 (GPT-4 ASV comparison)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ASV (Combined Attack, GPT-4 averaged over 7×7 pairs)0.75Other attacks (Fake Completion 0.70; Naive 0.62)Combined +0.05 vs Fake CompletionAverage over 7 target × 7 injected tasksTable 4; Fig.2Table 4
ASV and MR (Combined Attack averaged over 10 LLMs and 7×7 tasks)ASV=0.62; MR=0.78No defenseAverage across 10 LLMs and 49 target/injected pairsTable 5; Sec.6.2Table 5

What To Try In 7 Days

Run the paper's Combined Attack from the repo against your LLM pipeline to measure ASV and MR on your tasks.

Add a known-answer check as a gating detection and log FPR/FNR; do not rely on perplexity alone.

If you pre-process inputs (paraphrase/retokenize/delimit), measure accuracy on clean inputs to quantify utility loss.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

MRPC (GLUE)JflegHSOLRTESST2SMS SpamGigaword

Risks & Boundaries

Limitations

Attacker model assumes the attacker cannot change the instruction prompt and does not know internal prompt details.

All evaluated attacks are heuristic; optimization-based or fully adaptive attacks are left for future work.

When Not To Use

Not intended as a benchmark for jailbreaking/unsafe content elicitation (different threat model).

Not a full evaluation of adaptive attackers who can query the application repeatedly.

Failure Modes

Detection methods that look at text quality (perplexity) miss realistic injected instructions.

Prevention methods (paraphrase/retokenize) can degrade clean-task accuracy, causing unacceptable utility loss.

Core Entities

Models

GPT-4GPT-3.5-TurboPaLM 2 text-bison-001BardVicuna-33b-v1.3Vicuna-13b-v1.3Flan-UL2Llama-2-13b-chatLlama-2-7b-chatInternLM-Chat-7B

Metrics

Attack Success Value (ASV)Matching Rate (MR)Performance under No Attacks (PNA)AccuracyROUGE-1GLEUFalse Positive Rate (FPR)False Negative Rate (FNR)

Datasets

MRPCJflegHSOLRTESST2SMS SpamGigaword

Benchmarks

Prompt-injection benchmark (this paper): 7 target tasks × 7 injected tasks, 10 LLMs, 5 attacks, 10 d