Semi-automatic pipeline: teach an LLM to generate high-quality attack prompts, then iteratively fine-tune models to refuse them

October 19, 20238 min

Overview

Decision SnapshotNeeds Validation

The pipeline is practical and low-cost (example: SAP200 generated in ~35 hours and ~$10 API cost). Results are strong on tested models but are limited to the model families and evaluator used.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qifan Wang, Xiangnan He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cheaply generate realistic jailbreak prompts and use a small iterative fine-tune to significantly reduce harmful outputs while keeping product capabilities intact.

Who Should Care

Summary TLDR

The paper introduces two practical frameworks. The attack framework seeds a small set of human-written jailbreak prompts and uses an LLM (gpt-3.5-turbo) with in-context learning to expand them into large, high-quality attack prompt sets (SAP datasets). The defense framework uses those attacks in an iterative fine-tuning loop (instruction tuning with LoRA) so models learn to refuse harmful requests. Experiments show SAP30 produces much higher harmfulness scores than prior automatic or manual sets, fine-tuning can reduce harmful outputs to near-zero on tested Alpaca-LoRA models, and task performance on standard benchmarks stays intact. Code and SAP datasets are released.

Problem Statement

LLMs can be induced to produce harmful content. Creating large, highquality red-team prompts by hand is slow and costly. Fully automatic prompt generators scale but often produce low-quality attacks. We need a low-cost way to produce many realistic attack prompts and a practical defense loop that improves model safety without breaking regular capabilities.

Main Contribution

A semi-automatic attack framework that uses a few human prompts + in-context learning with gpt-3.5 to cheaply generate many highquality attack prompts.

An iterative defense framework that fine-tunes target LLMs on generated attack prompts (instruction tuning with LoRA) and re-expands hard prompts to avoid overfitting.

Key Findings

SAP30 attack set is far more effective than prior sets on evaluated LLMs.

Numbersgpt-3.5-turbo harmful score: SAP30=8.70 vs Dual-Use=5.41 vs BAD+=0.63 (Table 1)

Practical UseUse SAP-style semi-automatic generation to build stronger red-team suites instead of relying only on small manual lists or crude automatic prompts.

Evidence RefTable 1

Models without safety fine-tuning are easier to attack.

NumbersSAP30 on Alpaca-LoRA-7B = 8.80 vs gpt-3.5-turbo = 8.70; GPT-3.5 shows some resistance due to RLHF

Practical UseExpect open, instruction-tuned models to be more vulnerable than RLHF-aligned systems; plan defenses accordingly.

Evidence RefTable 1, Section 4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average harmfulness score (higher=more harmful)SAP30 on gpt-3.5-turbo = 8.70; Dual-Use = 5.41; BAD+ = 0.63Dual-Use, BAD+SAP30 +3.29 vs Dual-Use; +8.07 vs BAD+ on gpt-3.5Table 1 across datasetsTable 1 shows SAP30 yields substantially higher harmfulness scores than prior setsTable 1
LoRASAP30 = 8.80Dual-Use = 6.63+2.17Table 1Table 1 shows SAP30 outperforms Dual-Use on Alpaca-LoRA-7BTable 1

What To Try In 7 Days

Run SAP30 (or the public SAP20) against your model and evaluate outputs with a strong LLM judge (gpt-3.5) to find weak spots.

Fine-tune a small LoRA adapter on a handful (SAP5) of hard prompts, then re-evaluate and expand hard cases iteratively.

Integrate automatic attack-generation + evaluation into your safety CI to catch regressions before release.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Defense experiments focus mainly on Alpaca series; broader model families not tested.

Evaluator is gpt-3.5-turbo; it outperforms Perspective API but can misjudge outlier responses (Appendix D, Limitations).

When Not To Use

If you cannot legally or technically fine-tune the target model (closed API without fine-tune access).

For final production safety without independent human review; automated defenses need human oversight.

Failure Modes

Overfitting: 'refuse to answer' responses followed by unexpected harmful text after many iterations (Appendix A).

Evaluator blind spots: gpt-3.5 may misclassify rare or cleverly obfuscated harmful outputs.

Core Entities

Models

gpt-3.5-turbo-0301text-davinci-003LoRA

Metrics

harmfulness score (0-10)AccuracyROC AUCrecallprecision

Datasets

SAP5SAP10SAP20SAP30SAP200Dual-UseBAD+

Benchmarks

BoolQARC_EasyRACECBCOPA