BIPIA: a large benchmark and practical defenses for indirect prompt injection attacks on LLMs

December 21, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

9

Authors

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu

Links

Abstract / PDF

Why It Matters For Business

External content can silently hijack LLM outputs. Measure exposure with BIPIA and add simple defenses now; full model fine-tuning yields stronger protection if you control the model.

Summary TLDR

The paper introduces BIPIA, the first large benchmark for indirect prompt injection (malicious instructions embedded in external content) and tests 25 LLMs across five real-world tasks. Results show many models are vulnerable (average attack success rate 11.8%), with stronger text models often more likely to follow malicious instructions. The authors propose two defense families: simple black-box prompt techniques (in-context examples, multi-turn separation, explicit reminders) that reduce attacks substantially, and a white-box approach (special tokens + adversarial fine-tuning) that brings attack success near zero on the benchmark while preserving task quality. Code and dataset are released

Problem Statement

When LLMs read third-party content, hidden malicious instructions can hijack their outputs. There was no large, systematic benchmark or well-evaluated defenses for these "indirect prompt injection" attacks. Practitioners need a way to measure risk across tasks and practical fixes that keep normal behavior intact.

Main Contribution

BIPIA: a large benchmark (626,250 train / 86,250 test prompts) that covers five application tasks and 250 attacker goals.

An evaluation of 25 LLMs showing universal but varying vulnerability; more capable models often have higher text attack success rates.

Two practical defenses: black-box prompt-based methods (explicit reminder, multi-turn, in-context) and a white-box method (data markers + adversarial fine-tuning) with strong mitigation and limited task impact.

Key Findings

All evaluated LLMs show vulnerability to indirect prompt injection on BIPIA.

NumbersAverage overall ASR = 0.1179 (11.79%) on BIPIA (Table 2)

More capable LLMs tend to follow malicious text instructions more often.

NumbersPearson r=0.6423 (p<0.001) correlation between Elo and ASR on text tasks (Figure 2)

Attack instruction position matters: placing malicious text at the end increases success.

NumbersEnd-position yields highest ASR vs. middle/beginning (Figure 5)

Simple black-box prompt methods substantially reduce ASR with small quality cost.

NumbersGPT-4 overall ASR: original 0.3103 → multi-turn 0.2056; in-context 0.2408 (Table 3)

White-box defense (data markers + adversarial fine-tuning) reduces ASR to near-zero while keeping task performance.

NumbersWhite-box defenses reduced ASR to ≈0.005 on BIPIA in some configurations; roughly 10x reduction vs. original (Table 4)

Results

Average overall ASR on BIPIA

Value0.1179

GPT-4 overall ASR

Value0.3103

Black-box (multi-turn) effect on GPT-4

ValueOverall ASR 0.2056

BaselineOriginal ASR 0.3103

White-box adversarial fine-tune (example)

ValueOverall ASR ≈0.005

BaselineOriginal model ASR (model-dependent, e.g., ≥0.02)

BIPIA dataset size

ValueTrain: 626,250 prompts; Test: 86,250 prompts

Who Should Care

What To Try In 7 Days

Run a quick BIPIA-style test on your LLM pipeline to estimate ASR.

Add an explicit reminder in prompts: tell the model not to follow instructions inside external content.

Separate fetched content into an earlier conversation turn and keep user instruction last (multi-turn).

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • BIPIA covers many attacks but cannot represent all real-world malicious content patterns.
  • Black-box defenses reduce but do not eliminate ASR; attackers may adapt to prompt defenses.
  • White-box defense requires model access and fine-tuning infrastructure and may affect out-of-distribution behavior.

When Not To Use

  • When you do not process third-party content at all (no external inputs).
  • If the application cannot tolerate any model fine-tuning or vocabulary changes, white-box methods are infeasible.

Failure Modes

  • Attackers adapt payloads to bypass prompt reminders or in-context examples.
  • Position biases and long contexts may still let instructions slip through.
  • Fine-tuning on a particular defense dataset may overfit to that distribution and miss novel attacks.

Core Entities

Models

  • GPT-4
  • GPT-3.5-turbo
  • Vicuna-13B
  • Vicuna-7B
  • Vicuna-33B
  • Llama2-Chat-70B
  • Llama2-Chat-13B
  • Llama2-Chat-7B
  • WizardLM-70B
  • WizardLM-13B
  • MPT-30B-chat
  • MPT-7B-Chat
  • CodeLlama-34B
  • Mistral-7B
  • Guanaco-33B
  • ChatGLM2-6B
  • RWKV-4-Raven-14B
  • Alpaca-13B
  • OpenAssistant-Pythia-12B
  • GPT4All-13B-Snoozy
  • StableLM-Tuned-Alpaca-7b
  • Dolly-V2-12B
  • ChatGLM-6B
  • FastChat-T5-3B

Metrics

  • Attack Success Rate (ASR)
  • ROUGE-1 (recall)
  • MT-Bench capability score
  • Elo rating (Chatbot Arena)

Datasets

  • BIPIA
  • OpenAI Evals (email QA subset)
  • NewsQA
  • WikiTableQuestions
  • XSum
  • Stack Overflow (collected code QA)

Benchmarks

  • BIPIA
  • MT-Bench