Overview
Model shows strong AUPRC on internal and public tests and is adaptable via prompting and light fine-tuning, but dataset and language coverage are limited and the released model can be abused if used as a chat model.
Citations44
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 45%
Production readiness: 70%
Novelty: 35%
Why It Matters For Business
Llama Guard is a deployable, customizable safety filter that runs locally, adapts to new policies via prompts or light fine-tuning, and matches or beats common moderation APIs on public and internal tests.
Who Should Care
Summary TLDR
Llama Guard is a fine-tuned Llama2-7b model built to classify safety risks in both user prompts and model responses. The authors designed a small taxonomy (violence, sexual, criminal planning, guns, substances, self-harm, safe), annotated ~14k prompt/response pairs, and instruction-tuned the model to output safe/unsafe plus violated categories. Llama Guard matches or beats common moderation APIs on internal and public benchmarks (AUPRC: 0.945 on internal prompts, 0.953 on internal responses; 0.847 on OpenAI Mod zero-shot; 0.626 on ToxicChat zero-shot) and adapts well via zero-shot, few-shot, or light fine-tuning. Weights and code are released.
Problem Statement
Existing content-moderation APIs are rigid: they use fixed taxonomies, provide only API access, and often use small backbones. Products need a customizable, high-quality guardrail that checks both user inputs and model outputs, adapts to different policies, and can be fine-tuned locally.
Main Contribution
A compact safety taxonomy for human-AI conversations covering violence, sexual content, criminal planning, guns, substances, and self-harm.
Llama Guard: an instruction-tuned Llama2-7b model that classifies prompts and responses and lists violated taxonomy categories.
Key Findings
High in-policy classification performance on internal test set.
Competitive off-policy zero-shot performance on OpenAI moderation set.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| AUPRC (prompt classification) | 0.945 | OpenAI API 0.764 | +0.181 | Internal test set (prompt) | Table 2 reports AUPRC | Table 2 |
| AUPRC (response classification) | 0.953 | OpenAI API 0.769 | +0.184 | Internal test set (response) | Table 2 reports AUPRC | Table 2 |
What To Try In 7 Days
Run the released Llama Guard weights on a small sample of your product data to compare labels to current moderation.
Prompt Llama Guard with your policy (zero-shot) and check per-category outputs on edge cases.
Add 2–4 in-context examples per category and re-run to see few-shot gains quickly (<1 hour).
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Training and fine-tuning data mostly in English; non-English performance not guaranteed (Sec.6).
Dataset is small (13,997 examples) and may not cover all policy edge cases.
When Not To Use
As the only safety layer for free-form chat generation.
For moderation in languages other than English without further data.
Failure Modes
False negatives on novel or adversarial prompts leading to missed unsafe outputs.
False positives that block benign user content due to taxonomy mismatch.

