Overview
The CoU red-teaming method and HARMFULQA dataset are well-documented and reproducible; results rely on GPT-4 as judge and compute limits prevented some experiments, so apply tests locally before scaling.
Citations16
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
CoU-style prompts can bypass deployed guardrails often; test and harden public-facing LLMs, or fine-tune smaller models on curated safe conversations to reduce harmful outputs without losing much utility.
Who Should Care
Summary TLDR
The authors introduce RED-EVAL, a red-teaming prompt style called Chain of Utterances (CoU) that frames a conversation between a harmful agent and a helpful agent to coax harmful answers from LLMs. RED-EVAL achieves high attack success rates (ASR) on both closed-source systems (GPT-4 ≈65%, ChatGPT ≈73%) and many open-source models (>85%). They release HARMFULQA: ~1.9K harmful questions plus 9.5K 'blue' (safe) and 7.3K 'red' (jailbroken) conversations. They propose RED-INSTRUCT / SAFE-ALIGN: fine-tuning a Vicuna-7B variant (STARLING) on this data. Fine-tuning on safe ChatGPT conversations (blue data) improves safety (HHH, TruthfulQA) with small utility trade-offs; using red data directly is a
Problem Statement
Large LMs can produce harmful outputs when prompted. Existing red-team prompts (e.g., chain-of-thought) are often recognized and refused by defended systems. We need more effective red-teaming to find guardrail failures and a practical dataset + fine-tuning procedure to make smaller LMs safer while keeping utility.
Main Contribution
RED-EVAL: a Chain-of-Utterances (CoU) red-teaming prompt that frames a dialogue between a 'harmful' and a 'helpful' agent to elicit harmful completions.
HARMFULQA: a dataset of 1,960 harmful questions plus 9,536 blue (safe) and 7,356 red (jailbroken) conversations collected from ChatGPT using CoU prompts.
Key Findings
RED-EVAL jailbreaks widely deployed closed-source APIs frequently.
Open-source models are highly vulnerable to CoU red-teaming.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| RED-EVAL ASR (closed-source) | GPT-4 0.651; ChatGPT 0.728 | STANDARD/COT near 0 for closed-source | Large increase vs STANDARD/COT | DANGEROUSQA (200) / HARMFULQA (1,960) | Table 3; Abstract | Table 3 |
| RED-EVAL ASR (open-source average) | Average >0.86 | CoT average ~0.48; STANDARD ~0.12 | +~0.39 absolute vs CoT on open-source | DANGEROUSQA (200) | Table 3, Section 4.1 | Table 3 |
What To Try In 7 Days
Run RED-EVAL (CoU prompts with internal thoughts) against your public model to measure ASR.
Collect a small set of safe (blue) responses from a guarded model and fine-tune a lightweight client model on them.
Add a judge step (e.g., higher-quality classifier or model-based filter) to block responses flagged as harmful by RED-EVAL.
Agent Features
Planning
Frameworks
Architectures
Collaboration
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Training on red (harmful) data can destabilize learning and collapse generations if used too long or with large learning rates.
Experiments with full HARMFULQA on open-source models were limited by compute.
When Not To Use
Do not use Strategy-B (include red data) with large K steps or high LR without monitoring — it can reduce capabilities.
Do not assume single red-teaming pass finds all failure modes — use multiple prompt variants and human review.
Failure Modes
Model collapse (stops generating) when aggressively maximizing loss on harmful responses.
Fine-tuning instability leading to higher jailbreak susceptibility if red data training is noisy.

