Chain-of-Utterances prompts reliably jailbreak LLMs; fine-tuning on curated safe conversations reduces harm.

August 18, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

16

Authors

Rishabh Bhardwaj, Soujanya Poria

Links

Abstract / PDF

Why It Matters For Business

CoU-style prompts can bypass deployed guardrails often; test and harden public-facing LLMs, or fine-tune smaller models on curated safe conversations to reduce harmful outputs without losing much utility.

Summary TLDR

The authors introduce RED-EVAL, a red-teaming prompt style called Chain of Utterances (CoU) that frames a conversation between a harmful agent and a helpful agent to coax harmful answers from LLMs. RED-EVAL achieves high attack success rates (ASR) on both closed-source systems (GPT-4 ≈65%, ChatGPT ≈73%) and many open-source models (>85%). They release HARMFULQA: ~1.9K harmful questions plus 9.5K 'blue' (safe) and 7.3K 'red' (jailbroken) conversations. They propose RED-INSTRUCT / SAFE-ALIGN: fine-tuning a Vicuna-7B variant (STARLING) on this data. Fine-tuning on safe ChatGPT conversations (blue data) improves safety (HHH, TruthfulQA) with small utility trade-offs; using red data directly is a

Problem Statement

Large LMs can produce harmful outputs when prompted. Existing red-team prompts (e.g., chain-of-thought) are often recognized and refused by defended systems. We need more effective red-teaming to find guardrail failures and a practical dataset + fine-tuning procedure to make smaller LMs safer while keeping utility.

Main Contribution

RED-EVAL: a Chain-of-Utterances (CoU) red-teaming prompt that frames a dialogue between a 'harmful' and a 'helpful' agent to elicit harmful completions.

HARMFULQA: a dataset of 1,960 harmful questions plus 9,536 blue (safe) and 7,356 red (jailbroken) conversations collected from ChatGPT using CoU prompts.

RED-INSTRUCT / SAFE-ALIGN: two fine-tuning strategies using HARMFULQA to produce STARLING, a safety-aligned Vicuna-7B variant that improves safety metrics while largely preserving utility.

Key Findings

RED-EVAL jailbreaks widely deployed closed-source APIs frequently.

NumbersGPT-4 ASR 0.651; ChatGPT ASR 0.728 on tested harmful prompts

Open-source models are highly vulnerable to CoU red-teaming.

NumbersOpen-source models ASR >0.86 on evaluated setups (e.g., Vicuna-7B ~0.875–0.915)

Including 'internal thoughts' in the CoU prompt raises attack success markedly.

NumbersGPT-4 ASR2 0.651 with internal thoughts vs 0.386 without (absolute +0.265)

Fine-tuning on safe ChatGPT conversations (blue data) improves safety while largely preserving capability.

NumbersSTARLING (BLUE) TruthfulQA 48.90 vs Vicuna-7B 46.99; HHH average +~2.3%

Results

RED-EVAL ASR (closed-source)

ValueGPT-4 0.651; ChatGPT 0.728

BaselineSTANDARD/COT near 0 for closed-source

RED-EVAL ASR (open-source average)

ValueAverage >0.86

BaselineCoT average ~0.48; STANDARD ~0.12

Effect of internal thoughts in CoU prompt

ValueWith internal thoughts avg ASR2 0.689 vs without 0.522

BaselineCoU w/o internal thoughts

Safety / utility after SAFE-ALIGN

ValueSTARLING (BLUE-RED) TruthfulQA 49.60; STARLING (BLUE) 48.90; Vicuna-7B 46.99

BaselineVicuna-7B

Who Should Care

What To Try In 7 Days

Run RED-EVAL (CoU prompts with internal thoughts) against your public model to measure ASR.

Collect a small set of safe (blue) responses from a guarded model and fine-tune a lightweight client model on them.

Add a judge step (e.g., higher-quality classifier or model-based filter) to block responses flagged as harmful by RED-EVAL.

Agent Features

Planning

  • CoU roleplay (Red-LM/ Base-LM dialogue)

Frameworks

  • RED-EVAL
  • RED-INSTRUCT
  • SAFE-ALIGN

Architectures

  • decoder-only causal Transformer

Collaboration

  • two-agent roleplay in prompt

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training on red (harmful) data can destabilize learning and collapse generations if used too long or with large learning rates.
  • Experiments with full HARMFULQA on open-source models were limited by compute.
  • Evaluation relies heavily on GPT-4 as judge; judge bias and policy filtering can affect labels.
  • Prompt effectiveness is sensitive to template choices and needs human tuning.

When Not To Use

  • Do not use Strategy-B (include red data) with large K steps or high LR without monitoring — it can reduce capabilities.
  • Do not assume single red-teaming pass finds all failure modes — use multiple prompt variants and human review.
  • Avoid publishing or using raw red conversations without safeguards; they contain harmful content.

Failure Modes

  • Model collapse (stops generating) when aggressively maximizing loss on harmful responses.
  • Fine-tuning instability leading to higher jailbreak susceptibility if red data training is noisy.
  • Judge-label bias: GPT-4 refusals or labeling rules can skew measured ASR.

Core Entities

Models

  • Vicuna-7B
  • Vicuna-13B
  • StableBeluga-7B
  • StableBeluga-13B
  • LLaMA2-FT-7B
  • VICUNA-FT-7B
  • ChatGPT
  • GPT-4
  • STARLING

Metrics

  • ASR (Attack Success Rate)
  • HHH (helpful, honest, harmless)
  • TruthfulQA score
  • MMLU exact-match
  • BBH score

Datasets

  • HARMFULQA
  • DANGEROUSQA
  • ShareGPT
  • HHH
  • TruthfulQA
  • MMLU
  • BBH

Benchmarks

  • RED-EVAL
  • HHH
  • TruthfulQA
  • MMLU
  • BBH