Overview
The pipeline is practical and low-cost (example: SAP200 generated in ~35 hours and ~$10 API cost). Results are strong on tested models but are limited to the model families and evaluator used.
Citations3
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
You can cheaply generate realistic jailbreak prompts and use a small iterative fine-tune to significantly reduce harmful outputs while keeping product capabilities intact.
Who Should Care
Summary TLDR
The paper introduces two practical frameworks. The attack framework seeds a small set of human-written jailbreak prompts and uses an LLM (gpt-3.5-turbo) with in-context learning to expand them into large, high-quality attack prompt sets (SAP datasets). The defense framework uses those attacks in an iterative fine-tuning loop (instruction tuning with LoRA) so models learn to refuse harmful requests. Experiments show SAP30 produces much higher harmfulness scores than prior automatic or manual sets, fine-tuning can reduce harmful outputs to near-zero on tested Alpaca-LoRA models, and task performance on standard benchmarks stays intact. Code and SAP datasets are released.
Problem Statement
LLMs can be induced to produce harmful content. Creating large, highquality red-team prompts by hand is slow and costly. Fully automatic prompt generators scale but often produce low-quality attacks. We need a low-cost way to produce many realistic attack prompts and a practical defense loop that improves model safety without breaking regular capabilities.
Main Contribution
A semi-automatic attack framework that uses a few human prompts + in-context learning with gpt-3.5 to cheaply generate many highquality attack prompts.
An iterative defense framework that fine-tunes target LLMs on generated attack prompts (instruction tuning with LoRA) and re-expands hard prompts to avoid overfitting.
Key Findings
SAP30 attack set is far more effective than prior sets on evaluated LLMs.
Models without safety fine-tuning are easier to attack.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average harmfulness score (higher=more harmful) | SAP30 on gpt-3.5-turbo = 8.70; Dual-Use = 5.41; BAD+ = 0.63 | Dual-Use, BAD+ | SAP30 +3.29 vs Dual-Use; +8.07 vs BAD+ on gpt-3.5 | Table 1 across datasets | Table 1 shows SAP30 yields substantially higher harmfulness scores than prior sets | Table 1 |
| LoRA | SAP30 = 8.80 | Dual-Use = 6.63 | +2.17 | Table 1 | Table 1 shows SAP30 outperforms Dual-Use on Alpaca-LoRA-7B | Table 1 |
What To Try In 7 Days
Run SAP30 (or the public SAP20) against your model and evaluate outputs with a strong LLM judge (gpt-3.5) to find weak spots.
Fine-tune a small LoRA adapter on a handful (SAP5) of hard prompts, then re-evaluate and expand hard cases iteratively.
Integrate automatic attack-generation + evaluation into your safety CI to catch regressions before release.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Defense experiments focus mainly on Alpaca series; broader model families not tested.
Evaluator is gpt-3.5-turbo; it outperforms Perspective API but can misjudge outlier responses (Appendix D, Limitations).
When Not To Use
If you cannot legally or technically fine-tune the target model (closed API without fine-tune access).
For final production safety without independent human review; automated defenses need human oversight.
Failure Modes
Overfitting: 'refuse to answer' responses followed by unexpected harmful text after many iterations (Appendix A).
Evaluator blind spots: gpt-3.5 may misclassify rare or cleverly obfuscated harmful outputs.

