Small poisoned prompts can make LLMs output attacker-chosen tokens while keeping accuracy nearly intact

October 19, 20236 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

1

Authors

Hongwei Yao, Jian Lou, Zhan Qin

Links

Abstract / PDF

Why It Matters For Business

Shared or third-party prompts can hide backdoors that stealthily control outputs; this risks wrong decisions, data leaks, or brand harm if prompts are used in production.

Summary TLDR

PoisonPrompt is a bi-level optimization method that injects a stealthy backdoor into prompt-based LLM setups (both hard and soft prompts). Using a small poisoned subset (5% in experiments) and a gradient-based trigger search, the attack achieves very high attack success rates (often 95–100% ASR) while dropping clean-task accuracy by under 10%. The paper evaluates three LLMs (BERT, RoBERTa, LLaMA) and three prompting methods across six datasets and shows the attack is robust to trigger size.

Problem Statement

Outsourced or shared prompts can be modified to include backdoors that trigger attacker-chosen outputs. The paper asks whether a prompt-only poisoning attack can (1) reliably force target outputs when a trigger is present and (2) preserve normal accuracy when the trigger is absent.

Main Contribution

PoisonPrompt: a bi-level optimization method to inject backdoors into hard and soft prompts.

A gradient-based discrete trigger search (Hotflip-style) to find short effective triggers.

Empirical study on 3 LLMs, 3 prompt methods, and 6 datasets showing high ASR and small accuracy drops.

Analysis of fidelity and robustness, including trigger-size experiments.

Key Findings

Attack success rate (ASR) is very high for poisoned prompts.

NumbersASR often 95–100% across datasets and models (Table 1).

Clean-task accuracy drops are small when the prompt is poisoned.

NumbersAccuracy drop under 10% versus clean prompts (Fig.2).

Soft prompts are easier to backdoor and produce higher ASR than hard prompts.

NumbersPrompt-Tuning and P-Tuning v2 frequently reach 100% ASR; AutoPrompt sometimes lower (≈93%).

Attack remains effective across trigger lengths.

NumbersASR stays near 100% as trigger size increases (Fig.3).

Results

ASR

Value95–100%

ACC drop

Value<10% drop

Baselineclean prompt ACC

Soft vs hard prompt ASR

Valuesoft prompts: ~100%, hard prompts: 93% (example)

Who Should Care

What To Try In 7 Days

Audit any third-party prompt on held-out clean and adversarial-triggered inputs.

Do not accept opaque prompt files; require provenance and checksums.

Run simple trigger scans: try common short-token triggers and measure ASR-like behavior changes.

Optimization Features

Training Optimization

  • bi-level optimization for joint prompt and trigger training

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Assumes ability to insert poisoned prompts during prompt tuning (threat model: prompt training access).
  • Requires white-box access for gradient-based trigger search.
  • Evaluation limited to three LLMs and six datasets; real-world API behavior not tested.

When Not To Use

  • When prompts come from a trusted, verified source with integrity checks.
  • When you cannot modify or supply prompts to the model (no prompt-tuning stage).
  • When you lack access to model internals or embeddings for optimization.

Failure Modes

  • Input sanitization or filtering could remove or neutralize triggers and block the attack.
  • Adversarial training or prompt watermarking may reduce ASR.
  • Different deployment setups (e.g., closed APIs) may prevent poisoning during prompt tuning.

Core Entities

Models

  • bert-large-cased
  • roberta-large
  • llama-7b

Metrics

  • ACC
  • ASR

Datasets

  • SST2
  • IMDb
  • AG News
  • QQP
  • QNLI
  • MNLI

Benchmarks

  • GLUE