Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
3
Why It Matters For Business
You can cheaply generate realistic jailbreak prompts and use a small iterative fine-tune to significantly reduce harmful outputs while keeping product capabilities intact.
Summary TLDR
The paper introduces two practical frameworks. The attack framework seeds a small set of human-written jailbreak prompts and uses an LLM (gpt-3.5-turbo) with in-context learning to expand them into large, high-quality attack prompt sets (SAP datasets). The defense framework uses those attacks in an iterative fine-tuning loop (instruction tuning with LoRA) so models learn to refuse harmful requests. Experiments show SAP30 produces much higher harmfulness scores than prior automatic or manual sets, fine-tuning can reduce harmful outputs to near-zero on tested Alpaca-LoRA models, and task performance on standard benchmarks stays intact. Code and SAP datasets are released.
Problem Statement
LLMs can be induced to produce harmful content. Creating large, highquality red-team prompts by hand is slow and costly. Fully automatic prompt generators scale but often produce low-quality attacks. We need a low-cost way to produce many realistic attack prompts and a practical defense loop that improves model safety without breaking regular capabilities.
Main Contribution
A semi-automatic attack framework that uses a few human prompts + in-context learning with gpt-3.5 to cheaply generate many highquality attack prompts.
An iterative defense framework that fine-tunes target LLMs on generated attack prompts (instruction tuning with LoRA) and re-expands hard prompts to avoid overfitting.
A released suite of SAP attack-prompt datasets (sizes from 40 to 1,600 prompts) and experiments showing strong attack power and effective defense with small fine-tuning budgets.
Key Findings
SAP30 attack set is far more effective than prior sets on evaluated LLMs.
Models without safety fine-tuning are easier to attack.
Iterative fine-tuning with SAP prompts sharply reduces harmful outputs.
Defense fine-tuning has little negative impact on standard NLP tasks.
gpt-3.5-turbo is an effective automatic harmfulness judge in this pipeline.
SAP200 generation is low-cost and modest time.
Results
Average harmfulness score (higher=more harmful)
LoRA
Defense effect: harmfulness after fine-tuning
Accuracy
Who Should Care
What To Try In 7 Days
Run SAP30 (or the public SAP20) against your model and evaluate outputs with a strong LLM judge (gpt-3.5) to find weak spots.
Fine-tune a small LoRA adapter on a handful (SAP5) of hard prompts, then re-evaluate and expand hard cases iteratively.
Integrate automatic attack-generation + evaluation into your safety CI to catch regressions before release.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Defense experiments focus mainly on Alpaca series; broader model families not tested.
- Evaluator is gpt-3.5-turbo; it outperforms Perspective API but can misjudge outlier responses (Appendix D, Limitations).
- Overfitting can occur if the fine-tuning set is immutable; authors mitigate this by regenerating prompts each iteration.
- SAP datasets may mirror the distribution of their seed prompts and not cover all real-world attack styles.
When Not To Use
- If you cannot legally or technically fine-tune the target model (closed API without fine-tune access).
- For final production safety without independent human review; automated defenses need human oversight.
- When your threat model is very different from the eight topics covered by SAP.
Failure Modes
- Overfitting: 'refuse to answer' responses followed by unexpected harmful text after many iterations (Appendix A).
- Evaluator blind spots: gpt-3.5 may misclassify rare or cleverly obfuscated harmful outputs.
- Dataset misuse: released attack prompts could be weaponized if handled irresponsibly.
Core Entities
Models
- gpt-3.5-turbo-0301
- text-davinci-003
- LoRA
Metrics
- harmfulness score (0-10)
- Accuracy
- ROC AUC
- recall
- precision
Datasets
- SAP5
- SAP10
- SAP20
- SAP30
- SAP200
- Dual-Use
- BAD+
Benchmarks
- BoolQ
- ARC_Easy
- RACE
- CB
- COPA

