Chain-of-Utterances prompts reliably jailbreak LLMs; fine-tuning on curated safe conversations reduces harm.

August 18, 20238 min

Overview

Decision SnapshotReady For Pilot

The CoU red-teaming method and HARMFULQA dataset are well-documented and reproducible; results rely on GPT-4 as judge and compute limits prevented some experiments, so apply tests locally before scaling.

Citations16

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 60%

Authors

Rishabh Bhardwaj, Soujanya Poria

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CoU-style prompts can bypass deployed guardrails often; test and harden public-facing LLMs, or fine-tune smaller models on curated safe conversations to reduce harmful outputs without losing much utility.

Who Should Care

Summary TLDR

The authors introduce RED-EVAL, a red-teaming prompt style called Chain of Utterances (CoU) that frames a conversation between a harmful agent and a helpful agent to coax harmful answers from LLMs. RED-EVAL achieves high attack success rates (ASR) on both closed-source systems (GPT-4 ≈65%, ChatGPT ≈73%) and many open-source models (>85%). They release HARMFULQA: ~1.9K harmful questions plus 9.5K 'blue' (safe) and 7.3K 'red' (jailbroken) conversations. They propose RED-INSTRUCT / SAFE-ALIGN: fine-tuning a Vicuna-7B variant (STARLING) on this data. Fine-tuning on safe ChatGPT conversations (blue data) improves safety (HHH, TruthfulQA) with small utility trade-offs; using red data directly is a

Problem Statement

Large LMs can produce harmful outputs when prompted. Existing red-team prompts (e.g., chain-of-thought) are often recognized and refused by defended systems. We need more effective red-teaming to find guardrail failures and a practical dataset + fine-tuning procedure to make smaller LMs safer while keeping utility.

Main Contribution

RED-EVAL: a Chain-of-Utterances (CoU) red-teaming prompt that frames a dialogue between a 'harmful' and a 'helpful' agent to elicit harmful completions.

HARMFULQA: a dataset of 1,960 harmful questions plus 9,536 blue (safe) and 7,356 red (jailbroken) conversations collected from ChatGPT using CoU prompts.

Key Findings

RED-EVAL jailbreaks widely deployed closed-source APIs frequently.

NumbersGPT-4 ASR 0.651; ChatGPT ASR 0.728 on tested harmful prompts

Practical UseDo not assume API guardrails are bulletproof. Test production systems with CoU-style prompts and patch or filter outputs before public deployment.

Evidence RefTable 3, Abstract

Open-source models are highly vulnerable to CoU red-teaming.

NumbersOpen-source models ASR >0.86 on evaluated setups (e.g., Vicuna-7B ~0.8750.915)

Practical UseTreat out-of-the-box open-source chat models as high risk for harmful outputs; apply alignment/fine-tuning or runtime filters before use.

Evidence RefTable 3, Section 4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
RED-EVAL ASR (closed-source)GPT-4 0.651; ChatGPT 0.728STANDARD/COT near 0 for closed-sourceLarge increase vs STANDARD/COTDANGEROUSQA (200) / HARMFULQA (1,960)Table 3; AbstractTable 3
RED-EVAL ASR (open-source average)Average >0.86CoT average ~0.48; STANDARD ~0.12+~0.39 absolute vs CoT on open-sourceDANGEROUSQA (200)Table 3, Section 4.1Table 3

What To Try In 7 Days

Run RED-EVAL (CoU prompts with internal thoughts) against your public model to measure ASR.

Collect a small set of safe (blue) responses from a guarded model and fine-tune a lightweight client model on them.

Add a judge step (e.g., higher-quality classifier or model-based filter) to block responses flagged as harmful by RED-EVAL.

Agent Features

Planning
CoU roleplay (Red-LM/ Base-LM dialogue)
Frameworks
RED-EVALRED-INSTRUCTSAFE-ALIGN
Architectures
decoder-only causal Transformer
Collaboration
two-agent roleplay in prompt

Reproducibility

Risks & Boundaries

Limitations

Training on red (harmful) data can destabilize learning and collapse generations if used too long or with large learning rates.

Experiments with full HARMFULQA on open-source models were limited by compute.

When Not To Use

Do not use Strategy-B (include red data) with large K steps or high LR without monitoring — it can reduce capabilities.

Do not assume single red-teaming pass finds all failure modes — use multiple prompt variants and human review.

Failure Modes

Model collapse (stops generating) when aggressively maximizing loss on harmful responses.

Fine-tuning instability leading to higher jailbreak susceptibility if red data training is noisy.

Core Entities

Models

Vicuna-7BVicuna-13BStableBeluga-7BStableBeluga-13BLLaMA2-FT-7BVICUNA-FT-7BChatGPTGPT-4STARLING

Metrics

ASR (Attack Success Rate)HHH (helpful, honest, harmless)TruthfulQA scoreMMLU exact-matchBBH score

Datasets

HARMFULQADANGEROUSQAShareGPTHHHTruthfulQAMMLUBBH

Benchmarks

RED-EVALHHHTruthfulQAMMLUBBH