Small poisoned prompts can make LLMs output attacker-chosen tokens while keeping accuracy nearly intact

Overview

Decision SnapshotNeeds Validation

The attack is demonstrated across three popular models and six datasets with consistent high ASR, but it assumes ability to poison prompt training and uses white-box gradient-based search.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 30%

Novelty: 60%

Authors

Hongwei Yao, Jian Lou, Zhan Qin

Links

Abstract / PDF / Code

Why It Matters For Business

Shared or third-party prompts can hide backdoors that stealthily control outputs; this risks wrong decisions, data leaks, or brand harm if prompts are used in production.

Who Should Care

CTO ML Engineer Product Manager Data Scientist

Summary TLDR

PoisonPrompt is a bi-level optimization method that injects a stealthy backdoor into prompt-based LLM setups (both hard and soft prompts). Using a small poisoned subset (5% in experiments) and a gradient-based trigger search, the attack achieves very high attack success rates (often 95–100% ASR) while dropping clean-task accuracy by under 10%. The paper evaluates three LLMs (BERT, RoBERTa, LLaMA) and three prompting methods across six datasets and shows the attack is robust to trigger size.

Problem Statement

Outsourced or shared prompts can be modified to include backdoors that trigger attacker-chosen outputs. The paper asks whether a prompt-only poisoning attack can (1) reliably force target outputs when a trigger is present and (2) preserve normal accuracy when the trigger is absent.

Main Contribution

PoisonPrompt: a bi-level optimization method to inject backdoors into hard and soft prompts.

A gradient-based discrete trigger search (Hotflip-style) to find short effective triggers.

Key Findings

Attack success rate (ASR) is very high for poisoned prompts.

NumbersASR often 95–100% across datasets and models (Table 1).

Practical UseDo not use untrusted prompts: an attacker can reliably force outputs on triggered queries.

Evidence RefTable 1

Clean-task accuracy drops are small when the prompt is poisoned.

NumbersAccuracy drop under 10% versus clean prompts (Fig.2).

Practical UseBackdoors can be stealthy: normal validation tests may not reveal poisoning.

Evidence RefFigure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ASR	95–100%	—	—	SST2, IMDb, AG News, QQP, QNLI, MNLI (various)	Table 1 reports ASR often at or near 100% across prompt methods and models.	Table 1
ACC drop	<10% drop	clean prompt ACC	—	aggregated across datasets	Figure 2 shows accuracy of backdoored prompts drops by less than 10% versus clean prompts.	Figure 2

What To Try In 7 Days

Audit any third-party prompt on held-out clean and adversarial-triggered inputs.

Do not accept opaque prompt files; require provenance and checksums.

Run simple trigger scans: try common short-token triggers and measure ASR-like behavior changes.

Optimization Features

Training Optimization

bi-level optimization for joint prompt and trigger training

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/grasses/PoisonPrompt

Risks & Boundaries

Limitations

Assumes ability to insert poisoned prompts during prompt tuning (threat model: prompt training access).

Requires white-box access for gradient-based trigger search.

When Not To Use

When prompts come from a trusted, verified source with integrity checks.

When you cannot modify or supply prompts to the model (no prompt-tuning stage).

Failure Modes

Input sanitization or filtering could remove or neutralize triggers and block the attack.

Adversarial training or prompt watermarking may reduce ASR.

Core Entities

Models

bert-large-casedroberta-largellama-7b

Metrics

ACCASR

Datasets

SST2IMDbAG NewsQQPQNLIMNLI

Benchmarks

GLUE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Attack success rate (ASR) is very high for poisoned prompts.

Clean-task accuracy drops are small when the prompt is poisoned.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

AdversaRiskQA: adversarial factuality benchmark for health, finance, and law

Key finding

Short, natural-looking token sequences can flip LLM judges to say 'Yes' on wrong answers; discovery and a small LoRA defense

Key finding

FACT-BENCH: a 20K-question benchmark that reveals when LLMs forget facts and how exemplars can make them lie

Key finding

RWKU: a stress test for forgetting real-world facts in LLMs using 200 real-person targets and adversarial probes

Key finding

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding