Semi-automatic pipeline: teach an LLM to generate high-quality attack prompts, then iteratively fine-tune models to refuse them

Overview

Decision SnapshotNeeds Validation

The pipeline is practical and low-cost (example: SAP200 generated in ~35 hours and ~$10 API cost). Results are strong on tested models but are limited to the model families and evaluator used.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qifan Wang, Xiangnan He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cheaply generate realistic jailbreak prompts and use a small iterative fine-tune to significantly reduce harmful outputs while keeping product capabilities intact.

Who Should Care

CTO ML Engineer Data Scientist Engineering Lead Product Manager

Summary TLDR

The paper introduces two practical frameworks. The attack framework seeds a small set of human-written jailbreak prompts and uses an LLM (gpt-3.5-turbo) with in-context learning to expand them into large, high-quality attack prompt sets (SAP datasets). The defense framework uses those attacks in an iterative fine-tuning loop (instruction tuning with LoRA) so models learn to refuse harmful requests. Experiments show SAP30 produces much higher harmfulness scores than prior automatic or manual sets, fine-tuning can reduce harmful outputs to near-zero on tested Alpaca-LoRA models, and task performance on standard benchmarks stays intact. Code and SAP datasets are released.

Problem Statement

LLMs can be induced to produce harmful content. Creating large, highquality red-team prompts by hand is slow and costly. Fully automatic prompt generators scale but often produce low-quality attacks. We need a low-cost way to produce many realistic attack prompts and a practical defense loop that improves model safety without breaking regular capabilities.

Main Contribution

A semi-automatic attack framework that uses a few human prompts + in-context learning with gpt-3.5 to cheaply generate many highquality attack prompts.

An iterative defense framework that fine-tunes target LLMs on generated attack prompts (instruction tuning with LoRA) and re-expands hard prompts to avoid overfitting.

Key Findings

SAP30 attack set is far more effective than prior sets on evaluated LLMs.

Numbersgpt-3.5-turbo harmful score: SAP30=8.70 vs Dual-Use=5.41 vs BAD+=0.63 (Table 1)

Practical UseUse SAP-style semi-automatic generation to build stronger red-team suites instead of relying only on small manual lists or crude automatic prompts.

Evidence RefTable 1

Models without safety fine-tuning are easier to attack.

NumbersSAP30 on Alpaca-LoRA-7B = 8.80 vs gpt-3.5-turbo = 8.70; GPT-3.5 shows some resistance due to RLHF

Practical UseExpect open, instruction-tuned models to be more vulnerable than RLHF-aligned systems; plan defenses accordingly.

Evidence RefTable 1, Section 4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average harmfulness score (higher=more harmful)	SAP30 on gpt-3.5-turbo = 8.70; Dual-Use = 5.41; BAD+ = 0.63	Dual-Use, BAD+	SAP30 +3.29 vs Dual-Use; +8.07 vs BAD+ on gpt-3.5	Table 1 across datasets	Table 1 shows SAP30 yields substantially higher harmfulness scores than prior sets	Table 1
LoRA	SAP30 = 8.80	Dual-Use = 6.63	+2.17	Table 1	Table 1 shows SAP30 outperforms Dual-Use on Alpaca-LoRA-7B	Table 1

What To Try In 7 Days

Run SAP30 (or the public SAP20) against your model and evaluate outputs with a strong LLM judge (gpt-3.5) to find weak spots.

Fine-tune a small LoRA adapter on a handful (SAP5) of hard prompts, then re-evaluate and expand hard cases iteratively.

Integrate automatic attack-generation + evaluation into your safety CI to catch regressions before release.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Aatrox103/SAP

Data URLs

https://github.com/Aatrox103/SAP

Risks & Boundaries

Limitations

Defense experiments focus mainly on Alpaca series; broader model families not tested.

Evaluator is gpt-3.5-turbo; it outperforms Perspective API but can misjudge outlier responses (Appendix D, Limitations).

When Not To Use

If you cannot legally or technically fine-tune the target model (closed API without fine-tune access).

For final production safety without independent human review; automated defenses need human oversight.

Failure Modes

Overfitting: 'refuse to answer' responses followed by unexpected harmful text after many iterations (Appendix A).

Evaluator blind spots: gpt-3.5 may misclassify rare or cleverly obfuscated harmful outputs.

Core Entities

Models

gpt-3.5-turbo-0301text-davinci-003LoRA

Metrics

harmfulness score (0-10)AccuracyROC AUCrecallprecision

Datasets

SAP5SAP10SAP20SAP30SAP200Dual-UseBAD+

Benchmarks

BoolQARC_EasyRACECBCOPA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SAP30 attack set is far more effective than prior sets on evaluated LLMs.

Models without safety fine-tuning are easier to attack.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

AdversaRiskQA: adversarial factuality benchmark for health, finance, and law

Key finding

Short, natural-looking token sequences can flip LLM judges to say 'Yes' on wrong answers; discovery and a small LoRA defense

Key finding

FACT-BENCH: a 20K-question benchmark that reveals when LLMs forget facts and how exemplars can make them lie

Key finding

RWKU: a stress test for forgetting real-world facts in LLMs using 200 real-person targets and adversarial probes

Key finding

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding