Overview
Production Readiness
0.5
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Adaptive, optimization-driven prompt injections can bypass some defenses and expose sensitive outputs, so firms must test deployed LLMs (especially open-source ones) with rigorous red-teaming before production.
Summary TLDR
OET is an open, modular toolkit for building and running optimization-driven prompt-injection attacks and measuring defenses. It converts QA data, trains adversarial strings (white-box or black-box), injects them at test time, and reports Attack Success Rate (ASR). Experiments on 8 QA datasets show open-source models (e.g., Qwen2-7B-Instruct) have very high ASR (≥0.93–0.99), closed-source models (GPT-4o-mini, Claude-3.5) show much lower ASR (≈0.01–0.29), and recent defenses (StruQ, SecAlign) give inconsistent protection across domains. Code is public on GitHub.
Problem Statement
Existing prompt-injection benchmarks are static and cannot produce adaptive, optimization-based attacks that reveal worst-case failures. Practitioners need a flexible testbed that trains adversarial prompt strings, runs transferable attacks across models and domains, and reports consistent metrics for red-teaming and defense comparison.
Main Contribution
OET: a modular, extensible toolkit that trains and deploys optimization-based adversarial strings for prompt injection evaluation.
Curated multi-domain QA collection (law, finance, science, math, medical, code/email/table) standardized for attack/defense testing.
Extensive experiments showing open-source models are generally more vulnerable and current defenses work inconsistently across datasets.
Key Findings
Open-source models are substantially easier to coerce than the closed-source models tested.
Published defense methods reduce ASR unevenly and can make some domains worse.
Attack algorithm choice matters: some optimizers transfer much better than others.
Results
Attack Success Rate (ASR) — open vs closed
Defense effect (ASR) across domains
Attack method performance on SecAlign (ASR)
Who Should Care
What To Try In 7 Days
Run OET against your deployed model on a small representative QA set and measure ASR.
Test multiple attack families (GCG, UAT, LLM-as-optimizer) to find weakest spots.
Compare ASR before and after any input-sanitization or finetuning defense to spot regressions.
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation focuses on a single attack objective ('Print sql injection') which may not generalize to other goals.
- Training used very small per-domain training sets (mostly 5 examples), which limits realism of some adaptive attacks.
- Datasets are QA-only; non-QA apps (dialogs, code generation, multimodal) are not evaluated here.
When Not To Use
- As a claim of real-world safety guarantees — OET finds weaknesses but does not certify defenses.
- For non-QA tasks without adapting the conversion and attack pipeline.
- If you need a fully black-box, low-query attack budget evaluation — some optimizers assume gradient or iterative access.
Failure Modes
- Attack transferability may drop outside the tested QA domains or with different prompt formats.
- A defense tuned to the toolkit’s attack families might overfit and still fail on unseen optimization methods.
- ASR depends on the chosen attack objective and optimizer hyperparameters; different settings can change outcomes.
Core Entities
Models
- GPT-4o-mini
- Claude-3.5-sonnet
- LLama3.1-8B
- Vicuna-7B
- Qwen2-7B-Instruct
- LLaMA
Metrics
- ASR
Datasets
- BIPIA
- SQuAD
- CaseHold
- FinQA
- SciQ
- TriviaQA
- AQuA
- PubMedQA

