Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
Shared or third-party prompts can hide backdoors that stealthily control outputs; this risks wrong decisions, data leaks, or brand harm if prompts are used in production.
Summary TLDR
PoisonPrompt is a bi-level optimization method that injects a stealthy backdoor into prompt-based LLM setups (both hard and soft prompts). Using a small poisoned subset (5% in experiments) and a gradient-based trigger search, the attack achieves very high attack success rates (often 95–100% ASR) while dropping clean-task accuracy by under 10%. The paper evaluates three LLMs (BERT, RoBERTa, LLaMA) and three prompting methods across six datasets and shows the attack is robust to trigger size.
Problem Statement
Outsourced or shared prompts can be modified to include backdoors that trigger attacker-chosen outputs. The paper asks whether a prompt-only poisoning attack can (1) reliably force target outputs when a trigger is present and (2) preserve normal accuracy when the trigger is absent.
Main Contribution
PoisonPrompt: a bi-level optimization method to inject backdoors into hard and soft prompts.
A gradient-based discrete trigger search (Hotflip-style) to find short effective triggers.
Empirical study on 3 LLMs, 3 prompt methods, and 6 datasets showing high ASR and small accuracy drops.
Analysis of fidelity and robustness, including trigger-size experiments.
Key Findings
Attack success rate (ASR) is very high for poisoned prompts.
Clean-task accuracy drops are small when the prompt is poisoned.
Soft prompts are easier to backdoor and produce higher ASR than hard prompts.
Attack remains effective across trigger lengths.
Results
ASR
ACC drop
Soft vs hard prompt ASR
Who Should Care
What To Try In 7 Days
Audit any third-party prompt on held-out clean and adversarial-triggered inputs.
Do not accept opaque prompt files; require provenance and checksums.
Run simple trigger scans: try common short-token triggers and measure ASR-like behavior changes.
Optimization Features
Training Optimization
- bi-level optimization for joint prompt and trigger training
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Assumes ability to insert poisoned prompts during prompt tuning (threat model: prompt training access).
- Requires white-box access for gradient-based trigger search.
- Evaluation limited to three LLMs and six datasets; real-world API behavior not tested.
When Not To Use
- When prompts come from a trusted, verified source with integrity checks.
- When you cannot modify or supply prompts to the model (no prompt-tuning stage).
- When you lack access to model internals or embeddings for optimization.
Failure Modes
- Input sanitization or filtering could remove or neutralize triggers and block the attack.
- Adversarial training or prompt watermarking may reduce ASR.
- Different deployment setups (e.g., closed APIs) may prevent poisoning during prompt tuning.
Core Entities
Models
- bert-large-cased
- roberta-large
- llama-7b
Metrics
- ACC
- ASR
Datasets
- SST2
- IMDb
- AG News
- QQP
- QNLI
- MNLI
Benchmarks
- GLUE

