Llama Guard — an adaptable LLM filter that flags unsafe user prompts and AI responses

December 7, 20237 min

Overview

Decision SnapshotNeeds Validation

Model shows strong AUPRC on internal and public tests and is adaptable via prompting and light fine-tuning, but dataset and language coverage are limited and the released model can be abused if used as a chat model.

Citations44

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 70%

Novelty: 35%

Authors

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa

Links

Abstract / PDF / Code

Why It Matters For Business

Llama Guard is a deployable, customizable safety filter that runs locally, adapts to new policies via prompts or light fine-tuning, and matches or beats common moderation APIs on public and internal tests.

Who Should Care

Summary TLDR

Llama Guard is a fine-tuned Llama2-7b model built to classify safety risks in both user prompts and model responses. The authors designed a small taxonomy (violence, sexual, criminal planning, guns, substances, self-harm, safe), annotated ~14k prompt/response pairs, and instruction-tuned the model to output safe/unsafe plus violated categories. Llama Guard matches or beats common moderation APIs on internal and public benchmarks (AUPRC: 0.945 on internal prompts, 0.953 on internal responses; 0.847 on OpenAI Mod zero-shot; 0.626 on ToxicChat zero-shot) and adapts well via zero-shot, few-shot, or light fine-tuning. Weights and code are released.

Problem Statement

Existing content-moderation APIs are rigid: they use fixed taxonomies, provide only API access, and often use small backbones. Products need a customizable, high-quality guardrail that checks both user inputs and model outputs, adapts to different policies, and can be fine-tuned locally.

Main Contribution

A compact safety taxonomy for human-AI conversations covering violence, sexual content, criminal planning, guns, substances, and self-harm.

Llama Guard: an instruction-tuned Llama2-7b model that classifies prompts and responses and lists violated taxonomy categories.

Key Findings

High in-policy classification performance on internal test set.

NumbersAUPRC prompt=0.945; response=0.953 (Table 2)

Practical UseUse Llama Guard as an effective in-domain guardrail when you can supply its taxonomy and similar data.

Evidence RefTable 2

Competitive off-policy zero-shot performance on OpenAI moderation set.

NumbersAUPRC zero-shot=0.847 vs OpenAI API=0.856 (Table 2)

Practical UseYou can adapt Llama Guard to an unfamiliar policy by injecting that policy into the prompt with little loss in AUPRC.

Evidence RefTable 4 / Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AUPRC (prompt classification)0.945OpenAI API 0.764+0.181Internal test set (prompt)Table 2 reports AUPRCTable 2
AUPRC (response classification)0.953OpenAI API 0.769+0.184Internal test set (response)Table 2 reports AUPRCTable 2

What To Try In 7 Days

Run the released Llama Guard weights on a small sample of your product data to compare labels to current moderation.

Prompt Llama Guard with your policy (zero-shot) and check per-category outputs on edge cases.

Add 2–4 in-context examples per category and re-run to see few-shot gains quickly (<1 hour).

Optimization Features

Training Optimization
Instruction fine-tuning on a small, targeted dataset (∼1 epoch, 500 steps)

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Training and fine-tuning data mostly in English; non-English performance not guaranteed (Sec.6).

Dataset is small (13,997 examples) and may not cover all policy edge cases.

When Not To Use

As the only safety layer for free-form chat generation.

For moderation in languages other than English without further data.

Failure Modes

False negatives on novel or adversarial prompts leading to missed unsafe outputs.

False positives that block benign user content due to taxonomy mismatch.

Core Entities

Models

Llama2-7bLlama Guard (fine-tuned Llama2-7b)

Metrics

AUPRCPrecisionRecallF1

Datasets

Internal annotated dataset (13,997 prompt-response pairs)ToxicChatOpenAI Moderation EvaluationAnthropic harmlessness preference data (seed)

Benchmarks

ToxicChatOpenAI Moderation EvaluationInternal test set