Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
16
Why It Matters For Business
CoU-style prompts can bypass deployed guardrails often; test and harden public-facing LLMs, or fine-tune smaller models on curated safe conversations to reduce harmful outputs without losing much utility.
Summary TLDR
The authors introduce RED-EVAL, a red-teaming prompt style called Chain of Utterances (CoU) that frames a conversation between a harmful agent and a helpful agent to coax harmful answers from LLMs. RED-EVAL achieves high attack success rates (ASR) on both closed-source systems (GPT-4 ≈65%, ChatGPT ≈73%) and many open-source models (>85%). They release HARMFULQA: ~1.9K harmful questions plus 9.5K 'blue' (safe) and 7.3K 'red' (jailbroken) conversations. They propose RED-INSTRUCT / SAFE-ALIGN: fine-tuning a Vicuna-7B variant (STARLING) on this data. Fine-tuning on safe ChatGPT conversations (blue data) improves safety (HHH, TruthfulQA) with small utility trade-offs; using red data directly is a
Problem Statement
Large LMs can produce harmful outputs when prompted. Existing red-team prompts (e.g., chain-of-thought) are often recognized and refused by defended systems. We need more effective red-teaming to find guardrail failures and a practical dataset + fine-tuning procedure to make smaller LMs safer while keeping utility.
Main Contribution
RED-EVAL: a Chain-of-Utterances (CoU) red-teaming prompt that frames a dialogue between a 'harmful' and a 'helpful' agent to elicit harmful completions.
HARMFULQA: a dataset of 1,960 harmful questions plus 9,536 blue (safe) and 7,356 red (jailbroken) conversations collected from ChatGPT using CoU prompts.
RED-INSTRUCT / SAFE-ALIGN: two fine-tuning strategies using HARMFULQA to produce STARLING, a safety-aligned Vicuna-7B variant that improves safety metrics while largely preserving utility.
Key Findings
RED-EVAL jailbreaks widely deployed closed-source APIs frequently.
Open-source models are highly vulnerable to CoU red-teaming.
Including 'internal thoughts' in the CoU prompt raises attack success markedly.
Fine-tuning on safe ChatGPT conversations (blue data) improves safety while largely preserving capability.
Results
RED-EVAL ASR (closed-source)
RED-EVAL ASR (open-source average)
Effect of internal thoughts in CoU prompt
Safety / utility after SAFE-ALIGN
Who Should Care
What To Try In 7 Days
Run RED-EVAL (CoU prompts with internal thoughts) against your public model to measure ASR.
Collect a small set of safe (blue) responses from a guarded model and fine-tune a lightweight client model on them.
Add a judge step (e.g., higher-quality classifier or model-based filter) to block responses flagged as harmful by RED-EVAL.
Agent Features
Planning
- CoU roleplay (Red-LM/ Base-LM dialogue)
Frameworks
- RED-EVAL
- RED-INSTRUCT
- SAFE-ALIGN
Architectures
- decoder-only causal Transformer
Collaboration
- two-agent roleplay in prompt
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Training on red (harmful) data can destabilize learning and collapse generations if used too long or with large learning rates.
- Experiments with full HARMFULQA on open-source models were limited by compute.
- Evaluation relies heavily on GPT-4 as judge; judge bias and policy filtering can affect labels.
- Prompt effectiveness is sensitive to template choices and needs human tuning.
When Not To Use
- Do not use Strategy-B (include red data) with large K steps or high LR without monitoring — it can reduce capabilities.
- Do not assume single red-teaming pass finds all failure modes — use multiple prompt variants and human review.
- Avoid publishing or using raw red conversations without safeguards; they contain harmful content.
Failure Modes
- Model collapse (stops generating) when aggressively maximizing loss on harmful responses.
- Fine-tuning instability leading to higher jailbreak susceptibility if red data training is noisy.
- Judge-label bias: GPT-4 refusals or labeling rules can skew measured ASR.
Core Entities
Models
- Vicuna-7B
- Vicuna-13B
- StableBeluga-7B
- StableBeluga-13B
- LLaMA2-FT-7B
- VICUNA-FT-7B
- ChatGPT
- GPT-4
- STARLING
Metrics
- ASR (Attack Success Rate)
- HHH (helpful, honest, harmless)
- TruthfulQA score
- MMLU exact-match
- BBH score
Datasets
- HARMFULQA
- DANGEROUSQA
- ShareGPT
- HHH
- TruthfulQA
- MMLU
- BBH
Benchmarks
- RED-EVAL
- HHH
- TruthfulQA
- MMLU
- BBH

