Small poisoned prompts can make LLMs output attacker-chosen tokens while keeping accuracy nearly intact

October 19, 20236 min

Overview

Decision SnapshotNeeds Validation

The attack is demonstrated across three popular models and six datasets with consistent high ASR, but it assumes ability to poison prompt training and uses white-box gradient-based search.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 30%

Novelty: 60%

Authors

Hongwei Yao, Jian Lou, Zhan Qin

Links

Abstract / PDF / Code

Why It Matters For Business

Shared or third-party prompts can hide backdoors that stealthily control outputs; this risks wrong decisions, data leaks, or brand harm if prompts are used in production.

Who Should Care

Summary TLDR

PoisonPrompt is a bi-level optimization method that injects a stealthy backdoor into prompt-based LLM setups (both hard and soft prompts). Using a small poisoned subset (5% in experiments) and a gradient-based trigger search, the attack achieves very high attack success rates (often 95–100% ASR) while dropping clean-task accuracy by under 10%. The paper evaluates three LLMs (BERT, RoBERTa, LLaMA) and three prompting methods across six datasets and shows the attack is robust to trigger size.

Problem Statement

Outsourced or shared prompts can be modified to include backdoors that trigger attacker-chosen outputs. The paper asks whether a prompt-only poisoning attack can (1) reliably force target outputs when a trigger is present and (2) preserve normal accuracy when the trigger is absent.

Main Contribution

PoisonPrompt: a bi-level optimization method to inject backdoors into hard and soft prompts.

A gradient-based discrete trigger search (Hotflip-style) to find short effective triggers.

Key Findings

Attack success rate (ASR) is very high for poisoned prompts.

NumbersASR often 95100% across datasets and models (Table 1).

Practical UseDo not use untrusted prompts: an attacker can reliably force outputs on triggered queries.

Evidence RefTable 1

Clean-task accuracy drops are small when the prompt is poisoned.

NumbersAccuracy drop under 10% versus clean prompts (Fig.2).

Practical UseBackdoors can be stealthy: normal validation tests may not reveal poisoning.

Evidence RefFigure 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ASR95100%SST2, IMDb, AG News, QQP, QNLI, MNLI (various)Table 1 reports ASR often at or near 100% across prompt methods and models.Table 1
ACC drop<10% dropclean prompt ACCaggregated across datasetsFigure 2 shows accuracy of backdoored prompts drops by less than 10% versus clean prompts.Figure 2

What To Try In 7 Days

Audit any third-party prompt on held-out clean and adversarial-triggered inputs.

Do not accept opaque prompt files; require provenance and checksums.

Run simple trigger scans: try common short-token triggers and measure ASR-like behavior changes.

Optimization Features

Training Optimization
bi-level optimization for joint prompt and trigger training

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Assumes ability to insert poisoned prompts during prompt tuning (threat model: prompt training access).

Requires white-box access for gradient-based trigger search.

When Not To Use

When prompts come from a trusted, verified source with integrity checks.

When you cannot modify or supply prompts to the model (no prompt-tuning stage).

Failure Modes

Input sanitization or filtering could remove or neutralize triggers and block the attack.

Adversarial training or prompt watermarking may reduce ASR.

Core Entities

Models

bert-large-casedroberta-largellama-7b

Metrics

ACCASR

Datasets

SST2IMDbAG NewsQQPQNLIMNLI

Benchmarks

GLUE