Overview
Production Readiness
0.7
Novelty Score
0.35
Cost Impact Score
0.45
Citation Count
44
Why It Matters For Business
Llama Guard is a deployable, customizable safety filter that runs locally, adapts to new policies via prompts or light fine-tuning, and matches or beats common moderation APIs on public and internal tests.
Summary TLDR
Llama Guard is a fine-tuned Llama2-7b model built to classify safety risks in both user prompts and model responses. The authors designed a small taxonomy (violence, sexual, criminal planning, guns, substances, self-harm, safe), annotated ~14k prompt/response pairs, and instruction-tuned the model to output safe/unsafe plus violated categories. Llama Guard matches or beats common moderation APIs on internal and public benchmarks (AUPRC: 0.945 on internal prompts, 0.953 on internal responses; 0.847 on OpenAI Mod zero-shot; 0.626 on ToxicChat zero-shot) and adapts well via zero-shot, few-shot, or light fine-tuning. Weights and code are released.
Problem Statement
Existing content-moderation APIs are rigid: they use fixed taxonomies, provide only API access, and often use small backbones. Products need a customizable, high-quality guardrail that checks both user inputs and model outputs, adapts to different policies, and can be fine-tuned locally.
Main Contribution
A compact safety taxonomy for human-AI conversations covering violence, sexual content, criminal planning, guns, substances, and self-harm.
Llama Guard: an instruction-tuned Llama2-7b model that classifies prompts and responses and lists violated taxonomy categories.
A labeled dataset of 13,997 prompt-response pairs annotated for prompt/response category and safe/unsafe labels.
Demonstrations that the model adapts via zero-shot, few-shot, and small amounts of fine-tuning and that code and model weights are released.
Key Findings
High in-policy classification performance on internal test set.
Competitive off-policy zero-shot performance on OpenAI moderation set.
Outperforms other tools on ToxicChat without extra fine-tuning.
Few-shot and small fine-tuning speed up adaptation to new taxonomies.
Results
AUPRC (prompt classification)
AUPRC (response classification)
AUPRC (OpenAI Mod eval, zero-shot w/ taxonomy)
AUPRC (ToxicChat, zero-shot w/ taxonomy)
Training data size
Who Should Care
What To Try In 7 Days
Run the released Llama Guard weights on a small sample of your product data to compare labels to current moderation.
Prompt Llama Guard with your policy (zero-shot) and check per-category outputs on edge cases.
Add 2–4 in-context examples per category and re-run to see few-shot gains quickly (<1 hour).
Optimization Features
Training Optimization
- Instruction fine-tuning on a small, targeted dataset (∼1 epoch, 500 steps)
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Training and fine-tuning data mostly in English; non-English performance not guaranteed (Sec.6).
- Dataset is small (13,997 examples) and may not cover all policy edge cases.
- As a full LLM, it can be prompted to generate unsafe completions if used as a chat model.
- Susceptible to prompt injection attacks that may bypass intended behavior.
When Not To Use
- As the only safety layer for free-form chat generation.
- For moderation in languages other than English without further data.
- Where exhaustive policy coverage or legal compliance is required without human review.
Failure Modes
- False negatives on novel or adversarial prompts leading to missed unsafe outputs.
- False positives that block benign user content due to taxonomy mismatch.
- Prompt injection bypasses can change model behavior.
- Performance may degrade on out-of-domain phrasing not seen in training.
Core Entities
Models
- Llama2-7b
- Llama Guard (fine-tuned Llama2-7b)
Metrics
- AUPRC
- Precision
- Recall
- F1
Datasets
- Internal annotated dataset (13,997 prompt-response pairs)
- ToxicChat
- OpenAI Moderation Evaluation
- Anthropic harmlessness preference data (seed)
Benchmarks
- ToxicChat
- OpenAI Moderation Evaluation
- Internal test set

