Chain-of-Utterances prompts reliably jailbreak LLMs; fine-tuning on curated safe conversations reduces harm.

Overview

Decision SnapshotReady For Pilot

The CoU red-teaming method and HARMFULQA dataset are well-documented and reproducible; results rely on GPT-4 as judge and compute limits prevented some experiments, so apply tests locally before scaling.

Citations16

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 60%

Authors

Rishabh Bhardwaj, Soujanya Poria

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CoU-style prompts can bypass deployed guardrails often; test and harden public-facing LLMs, or fine-tune smaller models on curated safe conversations to reduce harmful outputs without losing much utility.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors introduce RED-EVAL, a red-teaming prompt style called Chain of Utterances (CoU) that frames a conversation between a harmful agent and a helpful agent to coax harmful answers from LLMs. RED-EVAL achieves high attack success rates (ASR) on both closed-source systems (GPT-4 ≈65%, ChatGPT ≈73%) and many open-source models (>85%). They release HARMFULQA: ~1.9K harmful questions plus 9.5K 'blue' (safe) and 7.3K 'red' (jailbroken) conversations. They propose RED-INSTRUCT / SAFE-ALIGN: fine-tuning a Vicuna-7B variant (STARLING) on this data. Fine-tuning on safe ChatGPT conversations (blue data) improves safety (HHH, TruthfulQA) with small utility trade-offs; using red data directly is a

Problem Statement

Large LMs can produce harmful outputs when prompted. Existing red-team prompts (e.g., chain-of-thought) are often recognized and refused by defended systems. We need more effective red-teaming to find guardrail failures and a practical dataset + fine-tuning procedure to make smaller LMs safer while keeping utility.

Main Contribution

RED-EVAL: a Chain-of-Utterances (CoU) red-teaming prompt that frames a dialogue between a 'harmful' and a 'helpful' agent to elicit harmful completions.

HARMFULQA: a dataset of 1,960 harmful questions plus 9,536 blue (safe) and 7,356 red (jailbroken) conversations collected from ChatGPT using CoU prompts.

Key Findings

RED-EVAL jailbreaks widely deployed closed-source APIs frequently.

NumbersGPT-4 ASR 0.651; ChatGPT ASR 0.728 on tested harmful prompts

Practical UseDo not assume API guardrails are bulletproof. Test production systems with CoU-style prompts and patch or filter outputs before public deployment.

Evidence RefTable 3, Abstract

Open-source models are highly vulnerable to CoU red-teaming.

NumbersOpen-source models ASR >0.86 on evaluated setups (e.g., Vicuna-7B ~0.875–0.915)

Practical UseTreat out-of-the-box open-source chat models as high risk for harmful outputs; apply alignment/fine-tuning or runtime filters before use.

Evidence RefTable 3, Section 4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
RED-EVAL ASR (closed-source)	GPT-4 0.651; ChatGPT 0.728	STANDARD/COT near 0 for closed-source	Large increase vs STANDARD/COT	DANGEROUSQA (200) / HARMFULQA (1,960)	Table 3; Abstract	Table 3
RED-EVAL ASR (open-source average)	Average >0.86	CoT average ~0.48; STANDARD ~0.12	+~0.39 absolute vs CoT on open-source	DANGEROUSQA (200)	Table 3, Section 4.1	Table 3

What To Try In 7 Days

Run RED-EVAL (CoU prompts with internal thoughts) against your public model to measure ASR.

Collect a small set of safe (blue) responses from a guarded model and fine-tune a lightweight client model on them.

Add a judge step (e.g., higher-quality classifier or model-based filter) to block responses flagged as harmful by RED-EVAL.

Agent Features

Planning

CoU roleplay (Red-LM/ Base-LM dialogue)

Frameworks

RED-EVALRED-INSTRUCTSAFE-ALIGN

Architectures

decoder-only causal Transformer

Collaboration

two-agent roleplay in prompt

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/declare-lab/red-instruct https://huggingface.co/datasets/declare-lab/HarmfulQA https://huggingface.co/declare-lab/starling-7B

Data URLs

https://huggingface.co/datasets/declare-lab/HarmfulQA

Risks & Boundaries

Limitations

Training on red (harmful) data can destabilize learning and collapse generations if used too long or with large learning rates.

Experiments with full HARMFULQA on open-source models were limited by compute.

When Not To Use

Do not use Strategy-B (include red data) with large K steps or high LR without monitoring — it can reduce capabilities.

Do not assume single red-teaming pass finds all failure modes — use multiple prompt variants and human review.

Failure Modes

Model collapse (stops generating) when aggressively maximizing loss on harmful responses.

Fine-tuning instability leading to higher jailbreak susceptibility if red data training is noisy.

Core Entities

Models

Vicuna-7BVicuna-13BStableBeluga-7BStableBeluga-13BLLaMA2-FT-7BVICUNA-FT-7BChatGPTGPT-4STARLING

Metrics

ASR (Attack Success Rate)HHH (helpful, honest, harmless)TruthfulQA scoreMMLU exact-matchBBH score

Datasets

HARMFULQADANGEROUSQAShareGPTHHHTruthfulQAMMLUBBH

Benchmarks

RED-EVALHHHTruthfulQAMMLUBBH

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RED-EVAL jailbreaks widely deployed closed-source APIs frequently.

Open-source models are highly vulnerable to CoU red-teaming.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

AdversaRiskQA: adversarial factuality benchmark for health, finance, and law

Key finding

Short, natural-looking token sequences can flip LLM judges to say 'Yes' on wrong answers; discovery and a small LoRA defense

Key finding

FACT-BENCH: a 20K-question benchmark that reveals when LLMs forget facts and how exemplars can make them lie

Key finding

RWKU: a stress test for forgetting real-world facts in LLMs using 200 real-person targets and adversarial probes

Key finding

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding