Overview
The attack is demonstrated across three popular models and six datasets with consistent high ASR, but it assumes ability to poison prompt training and uses white-box gradient-based search.
Citations1
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
Shared or third-party prompts can hide backdoors that stealthily control outputs; this risks wrong decisions, data leaks, or brand harm if prompts are used in production.
Who Should Care
Summary TLDR
PoisonPrompt is a bi-level optimization method that injects a stealthy backdoor into prompt-based LLM setups (both hard and soft prompts). Using a small poisoned subset (5% in experiments) and a gradient-based trigger search, the attack achieves very high attack success rates (often 95–100% ASR) while dropping clean-task accuracy by under 10%. The paper evaluates three LLMs (BERT, RoBERTa, LLaMA) and three prompting methods across six datasets and shows the attack is robust to trigger size.
Problem Statement
Outsourced or shared prompts can be modified to include backdoors that trigger attacker-chosen outputs. The paper asks whether a prompt-only poisoning attack can (1) reliably force target outputs when a trigger is present and (2) preserve normal accuracy when the trigger is absent.
Main Contribution
PoisonPrompt: a bi-level optimization method to inject backdoors into hard and soft prompts.
A gradient-based discrete trigger search (Hotflip-style) to find short effective triggers.
Key Findings
Attack success rate (ASR) is very high for poisoned prompts.
Clean-task accuracy drops are small when the prompt is poisoned.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ASR | 95–100% | — | — | SST2, IMDb, AG News, QQP, QNLI, MNLI (various) | Table 1 reports ASR often at or near 100% across prompt methods and models. | Table 1 |
| ACC drop | <10% drop | clean prompt ACC | — | aggregated across datasets | Figure 2 shows accuracy of backdoored prompts drops by less than 10% versus clean prompts. | Figure 2 |
What To Try In 7 Days
Audit any third-party prompt on held-out clean and adversarial-triggered inputs.
Do not accept opaque prompt files; require provenance and checksums.
Run simple trigger scans: try common short-token triggers and measure ASR-like behavior changes.
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Assumes ability to insert poisoned prompts during prompt tuning (threat model: prompt training access).
Requires white-box access for gradient-based trigger search.
When Not To Use
When prompts come from a trusted, verified source with integrity checks.
When you cannot modify or supply prompts to the model (no prompt-tuning stage).
Failure Modes
Input sanitization or filtering could remove or neutralize triggers and block the attack.
Adversarial training or prompt watermarking may reduce ASR.

