Semi-automatic pipeline: teach an LLM to generate high-quality attack prompts, then iteratively fine-tune models to refuse them

October 19, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

3

Authors

Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qifan Wang, Xiangnan He

Links

Abstract / PDF

Why It Matters For Business

You can cheaply generate realistic jailbreak prompts and use a small iterative fine-tune to significantly reduce harmful outputs while keeping product capabilities intact.

Summary TLDR

The paper introduces two practical frameworks. The attack framework seeds a small set of human-written jailbreak prompts and uses an LLM (gpt-3.5-turbo) with in-context learning to expand them into large, high-quality attack prompt sets (SAP datasets). The defense framework uses those attacks in an iterative fine-tuning loop (instruction tuning with LoRA) so models learn to refuse harmful requests. Experiments show SAP30 produces much higher harmfulness scores than prior automatic or manual sets, fine-tuning can reduce harmful outputs to near-zero on tested Alpaca-LoRA models, and task performance on standard benchmarks stays intact. Code and SAP datasets are released.

Problem Statement

LLMs can be induced to produce harmful content. Creating large, highquality red-team prompts by hand is slow and costly. Fully automatic prompt generators scale but often produce low-quality attacks. We need a low-cost way to produce many realistic attack prompts and a practical defense loop that improves model safety without breaking regular capabilities.

Main Contribution

A semi-automatic attack framework that uses a few human prompts + in-context learning with gpt-3.5 to cheaply generate many highquality attack prompts.

An iterative defense framework that fine-tunes target LLMs on generated attack prompts (instruction tuning with LoRA) and re-expands hard prompts to avoid overfitting.

A released suite of SAP attack-prompt datasets (sizes from 40 to 1,600 prompts) and experiments showing strong attack power and effective defense with small fine-tuning budgets.

Key Findings

SAP30 attack set is far more effective than prior sets on evaluated LLMs.

Numbersgpt-3.5-turbo harmful score: SAP30=8.70 vs Dual-Use=5.41 vs BAD+=0.63 (Table 1)

Models without safety fine-tuning are easier to attack.

NumbersSAP30 on Alpaca-LoRA-7B = 8.80 vs gpt-3.5-turbo = 8.70; GPT-3.5 shows some resistance due to RLHF

Iterative fine-tuning with SAP prompts sharply reduces harmful outputs.

NumbersAlpaca-LoRA-7B harmful score on SAP20: 8.49 -> after SAP5 fine-tune = 0.01 (Table 2)

Defense fine-tuning has little negative impact on standard NLP tasks.

Numbers13B model ARC_Easy accuracy stays ~0.763 before/after SAP5 fine-tune (Table 3)

gpt-3.5-turbo is an effective automatic harmfulness judge in this pipeline.

NumbersEvaluator recall=0.94 and precision=1.00 at threshold 5 vs Perspective API (Appendix D)

SAP200 generation is low-cost and modest time.

NumbersSAP200 took ~35 hours and ~$10 in OpenAI API calls (Appendix E.4)

Results

Average harmfulness score (higher=more harmful)

ValueSAP30 on gpt-3.5-turbo = 8.70; Dual-Use = 5.41; BAD+ = 0.63

BaselineDual-Use, BAD+

LoRA

ValueSAP30 = 8.80

BaselineDual-Use = 6.63

Defense effect: harmfulness after fine-tuning

ValueAlpaca-LoRA-7B on SAP20: before=8.49 -> after SAP5 fine-tune=0.01

Baselinebefore fine-tune

Accuracy

Value13B ARC_Easy accuracy: original=0.763 -> SAP5 fine-tune=0.763

Baselineoriginal model

Who Should Care

What To Try In 7 Days

Run SAP30 (or the public SAP20) against your model and evaluate outputs with a strong LLM judge (gpt-3.5) to find weak spots.

Fine-tune a small LoRA adapter on a handful (SAP5) of hard prompts, then re-evaluate and expand hard cases iteratively.

Integrate automatic attack-generation + evaluation into your safety CI to catch regressions before release.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Defense experiments focus mainly on Alpaca series; broader model families not tested.
  • Evaluator is gpt-3.5-turbo; it outperforms Perspective API but can misjudge outlier responses (Appendix D, Limitations).
  • Overfitting can occur if the fine-tuning set is immutable; authors mitigate this by regenerating prompts each iteration.
  • SAP datasets may mirror the distribution of their seed prompts and not cover all real-world attack styles.

When Not To Use

  • If you cannot legally or technically fine-tune the target model (closed API without fine-tune access).
  • For final production safety without independent human review; automated defenses need human oversight.
  • When your threat model is very different from the eight topics covered by SAP.

Failure Modes

  • Overfitting: 'refuse to answer' responses followed by unexpected harmful text after many iterations (Appendix A).
  • Evaluator blind spots: gpt-3.5 may misclassify rare or cleverly obfuscated harmful outputs.
  • Dataset misuse: released attack prompts could be weaponized if handled irresponsibly.

Core Entities

Models

  • gpt-3.5-turbo-0301
  • text-davinci-003
  • LoRA

Metrics

  • harmfulness score (0-10)
  • Accuracy
  • ROC AUC
  • recall
  • precision

Datasets

  • SAP5
  • SAP10
  • SAP20
  • SAP30
  • SAP200
  • Dual-Use
  • BAD+

Benchmarks

  • BoolQ
  • ARC_Easy
  • RACE
  • CB
  • COPA