OET: a modular toolkit that generates optimization-based adversarial prompts and benchmarks defenses

May 1, 20256 min

Overview

Production Readiness

0.5

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Jinsheng Pan, Xiaogeng Liu, Chaowei Xiao

Links

Abstract / PDF

Why It Matters For Business

Adaptive, optimization-driven prompt injections can bypass some defenses and expose sensitive outputs, so firms must test deployed LLMs (especially open-source ones) with rigorous red-teaming before production.

Summary TLDR

OET is an open, modular toolkit for building and running optimization-driven prompt-injection attacks and measuring defenses. It converts QA data, trains adversarial strings (white-box or black-box), injects them at test time, and reports Attack Success Rate (ASR). Experiments on 8 QA datasets show open-source models (e.g., Qwen2-7B-Instruct) have very high ASR (≥0.93–0.99), closed-source models (GPT-4o-mini, Claude-3.5) show much lower ASR (≈0.01–0.29), and recent defenses (StruQ, SecAlign) give inconsistent protection across domains. Code is public on GitHub.

Problem Statement

Existing prompt-injection benchmarks are static and cannot produce adaptive, optimization-based attacks that reveal worst-case failures. Practitioners need a flexible testbed that trains adversarial prompt strings, runs transferable attacks across models and domains, and reports consistent metrics for red-teaming and defense comparison.

Main Contribution

OET: a modular, extensible toolkit that trains and deploys optimization-based adversarial strings for prompt injection evaluation.

Curated multi-domain QA collection (law, finance, science, math, medical, code/email/table) standardized for attack/defense testing.

Extensive experiments showing open-source models are generally more vulnerable and current defenses work inconsistently across datasets.

Key Findings

Open-source models are substantially easier to coerce than the closed-source models tested.

NumbersQwen2-7B-Instruct ASR 0.93–0.99 across tasks; GPT-4o-mini ASR 0.01–0.03

Published defense methods reduce ASR unevenly and can make some domains worse.

NumbersStruQ ASR 0.0 on many sets but +0.43 on TriviaQA; SecAlign increases ASR by +0.46 and +0.59 on AQuA and PubMedQA

Attack algorithm choice matters: some optimizers transfer much better than others.

NumbersUAT ASR up to 0.78 (SciQA) and 0.51 (BIPIA); AutoDAN and PEZ often near 0 ASR on SecAlign

Results

Attack Success Rate (ASR) — open vs closed

ValueQwen2-7B-Instruct ASR 0.93–0.99; LLama3.1-8B 0.68–0.95; GPT-4o-mini 0.01–0.03

Baselineclosed-source models

Defense effect (ASR) across domains

ValueStruQ: ASR 0.0 on many datasets but +0.43 on TriviaQA; SecAlign: ASR increases +0.46 (AQuA), +0.59 (PubMedQA)

BaselineBase undefended LLaMA

Attack method performance on SecAlign (ASR)

ValueGCG: 0.16–0.59; UAT: up to 0.78; AutoDAN/PEZ: often ≈0

BaselineSecAlign defended model

Who Should Care

What To Try In 7 Days

Run OET against your deployed model on a small representative QA set and measure ASR.

Test multiple attack families (GCG, UAT, LLM-as-optimizer) to find weakest spots.

Compare ASR before and after any input-sanitization or finetuning defense to spot regressions.

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation focuses on a single attack objective ('Print sql injection') which may not generalize to other goals.
  • Training used very small per-domain training sets (mostly 5 examples), which limits realism of some adaptive attacks.
  • Datasets are QA-only; non-QA apps (dialogs, code generation, multimodal) are not evaluated here.

When Not To Use

  • As a claim of real-world safety guarantees — OET finds weaknesses but does not certify defenses.
  • For non-QA tasks without adapting the conversion and attack pipeline.
  • If you need a fully black-box, low-query attack budget evaluation — some optimizers assume gradient or iterative access.

Failure Modes

  • Attack transferability may drop outside the tested QA domains or with different prompt formats.
  • A defense tuned to the toolkit’s attack families might overfit and still fail on unseen optimization methods.
  • ASR depends on the chosen attack objective and optimizer hyperparameters; different settings can change outcomes.

Core Entities

Models

  • GPT-4o-mini
  • Claude-3.5-sonnet
  • LLama3.1-8B
  • Vicuna-7B
  • Qwen2-7B-Instruct
  • LLaMA

Metrics

  • ASR

Datasets

  • BIPIA
  • SQuAD
  • CaseHold
  • FinQA
  • SciQ
  • TriviaQA
  • AQuA
  • PubMedQA