BIPIA: a large benchmark and practical defenses for indirect prompt injection attacks on LLMs

Overview

Decision SnapshotReady For Pilot

The paper provides a large benchmark and clear evaluations; black-box fixes are cheap to try while white-box fine-tuning is stronger but requires model control and some compute.

Citations9

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

External content can silently hijack LLM outputs. Measure exposure with BIPIA and add simple defenses now; full model fine-tuning yields stronger protection if you control the model.

Who Should Care

Product Manager CTO ML Engineer Engineering Lead Founder

Summary TLDR

The paper introduces BIPIA, the first large benchmark for indirect prompt injection (malicious instructions embedded in external content) and tests 25 LLMs across five real-world tasks. Results show many models are vulnerable (average attack success rate 11.8%), with stronger text models often more likely to follow malicious instructions. The authors propose two defense families: simple black-box prompt techniques (in-context examples, multi-turn separation, explicit reminders) that reduce attacks substantially, and a white-box approach (special tokens + adversarial fine-tuning) that brings attack success near zero on the benchmark while preserving task quality. Code and dataset are released

Problem Statement

When LLMs read third-party content, hidden malicious instructions can hijack their outputs. There was no large, systematic benchmark or well-evaluated defenses for these "indirect prompt injection" attacks. Practitioners need a way to measure risk across tasks and practical fixes that keep normal behavior intact.

Main Contribution

BIPIA: a large benchmark (626,250 train / 86,250 test prompts) that covers five application tasks and 250 attacker goals.

An evaluation of 25 LLMs showing universal but varying vulnerability; more capable models often have higher text attack success rates.

Key Findings

All evaluated LLMs show vulnerability to indirect prompt injection on BIPIA.

NumbersAverage overall ASR = 0.1179 (11.79%) on BIPIA (Table 2)

Practical UseAssume some risk when feeding external content to LLMs; audit pipelines and add defenses before deployment.

Evidence RefTable 2

More capable LLMs tend to follow malicious text instructions more often.

NumbersPearson r=0.6423 (p<0.001) correlation between Elo and ASR on text tasks (Figure 2)

Practical UseHigher-capability chat models may need stronger guarding layers even if they perform better on benign tasks.

Evidence RefFigure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average overall ASR on BIPIA	0.1179	—	—	BIPIA (all tasks)	Average overall ASR across evaluated models	Table 2
GPT-4 overall ASR	0.3103	—	—	BIPIA (all tasks)	GPT-4 attack success rate across tasks	Table 2

What To Try In 7 Days

Run a quick BIPIA-style test on your LLM pipeline to estimate ASR.

Add an explicit reminder in prompts: tell the model not to follow instructions inside external content.

Separate fetched content into an earlier conversation turn and keep user instruction last (multi-turn).

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/microsoft/BIPIA https://arxiv.org/abs/2312.14197 https://arxiv.org/pdf/2312.14197v4

Data URLs

https://github.com/microsoft/BIPIA (dataset and generation scripts referenced)

Risks & Boundaries

Limitations

BIPIA covers many attacks but cannot represent all real-world malicious content patterns.

Black-box defenses reduce but do not eliminate ASR; attackers may adapt to prompt defenses.

When Not To Use

When you do not process third-party content at all (no external inputs).

If the application cannot tolerate any model fine-tuning or vocabulary changes, white-box methods are infeasible.

Failure Modes

Attackers adapt payloads to bypass prompt reminders or in-context examples.

Position biases and long contexts may still let instructions slip through.

Core Entities

Models

GPT-4GPT-3.5-turboVicuna-13BVicuna-7BVicuna-33BLlama2-Chat-70BLlama2-Chat-13BLlama2-Chat-7BWizardLM-70BWizardLM-13BMPT-30B-chatMPT-7B-ChatCodeLlama-34BMistral-7BGuanaco-33BChatGLM2-6BRWKV-4-Raven-14BAlpaca-13BOpenAssistant-Pythia-12BGPT4All-13B-SnoozyStableLM-Tuned-Alpaca-7bDolly-V2-12BChatGLM-6BFastChat-T5-3B

Metrics

Attack Success Rate (ASR)ROUGE-1 (recall)MT-Bench capability scoreElo rating (Chatbot Arena)

Datasets

BIPIAOpenAI Evals (email QA subset)NewsQAWikiTableQuestionsXSumStack Overflow (collected code QA)

Benchmarks

BIPIAMT-Bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

All evaluated LLMs show vulnerability to indirect prompt injection on BIPIA.

More capable LLMs tend to follow malicious text instructions more often.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding

JudgeDeceiver: automatically craft prompts that reliably trick LLM-as-a-Judge to pick an attacker’s response

Key finding

Make tool-using LLM agents provably safe by combining safety engineering, info-flow labels, and MCP extensions

Key finding

A systematic, practitioner-focused map of 193 multi-agent security threats and how 16 frameworks cover them

Key finding