Llama Guard — an adaptable LLM filter that flags unsafe user prompts and AI responses

Overview

Decision SnapshotNeeds Validation

Model shows strong AUPRC on internal and public tests and is adaptable via prompting and light fine-tuning, but dataset and language coverage are limited and the released model can be abused if used as a chat model.

Citations44

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 70%

Novelty: 35%

Authors

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa

Links

Abstract / PDF / Code

Why It Matters For Business

Llama Guard is a deployable, customizable safety filter that runs locally, adapts to new policies via prompts or light fine-tuning, and matches or beats common moderation APIs on public and internal tests.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead Data Scientist

Summary TLDR

Llama Guard is a fine-tuned Llama2-7b model built to classify safety risks in both user prompts and model responses. The authors designed a small taxonomy (violence, sexual, criminal planning, guns, substances, self-harm, safe), annotated ~14k prompt/response pairs, and instruction-tuned the model to output safe/unsafe plus violated categories. Llama Guard matches or beats common moderation APIs on internal and public benchmarks (AUPRC: 0.945 on internal prompts, 0.953 on internal responses; 0.847 on OpenAI Mod zero-shot; 0.626 on ToxicChat zero-shot) and adapts well via zero-shot, few-shot, or light fine-tuning. Weights and code are released.

Problem Statement

Existing content-moderation APIs are rigid: they use fixed taxonomies, provide only API access, and often use small backbones. Products need a customizable, high-quality guardrail that checks both user inputs and model outputs, adapts to different policies, and can be fine-tuned locally.

Main Contribution

A compact safety taxonomy for human-AI conversations covering violence, sexual content, criminal planning, guns, substances, and self-harm.

Llama Guard: an instruction-tuned Llama2-7b model that classifies prompts and responses and lists violated taxonomy categories.

Key Findings

High in-policy classification performance on internal test set.

NumbersAUPRC prompt=0.945; response=0.953 (Table 2)

Practical UseUse Llama Guard as an effective in-domain guardrail when you can supply its taxonomy and similar data.

Evidence RefTable 2

Competitive off-policy zero-shot performance on OpenAI moderation set.

NumbersAUPRC zero-shot=0.847 vs OpenAI API=0.856 (Table 2)

Practical UseYou can adapt Llama Guard to an unfamiliar policy by injecting that policy into the prompt with little loss in AUPRC.

Evidence RefTable 4 / Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
AUPRC (prompt classification)	0.945	OpenAI API 0.764	+0.181	Internal test set (prompt)	Table 2 reports AUPRC	Table 2
AUPRC (response classification)	0.953	OpenAI API 0.769	+0.184	Internal test set (response)	Table 2 reports AUPRC	Table 2

What To Try In 7 Days

Run the released Llama Guard weights on a small sample of your product data to compare labels to current moderation.

Prompt Llama Guard with your policy (zero-shot) and check per-category outputs on edge cases.

Add 2–4 in-context examples per category and re-run to see few-shot gains quickly (<1 hour).

Optimization Features

Training Optimization

Instruction fine-tuning on a small, targeted dataset (∼1 epoch, 500 steps)

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard

Risks & Boundaries

Limitations

Training and fine-tuning data mostly in English; non-English performance not guaranteed (Sec.6).

Dataset is small (13,997 examples) and may not cover all policy edge cases.

When Not To Use

As the only safety layer for free-form chat generation.

For moderation in languages other than English without further data.

Failure Modes

False negatives on novel or adversarial prompts leading to missed unsafe outputs.

False positives that block benign user content due to taxonomy mismatch.

Core Entities

Models

Llama2-7bLlama Guard (fine-tuned Llama2-7b)

Metrics

AUPRCPrecisionRecallF1

Datasets

Internal annotated dataset (13,997 prompt-response pairs)ToxicChatOpenAI Moderation EvaluationAnthropic harmlessness preference data (seed)

Benchmarks

ToxicChatOpenAI Moderation EvaluationInternal test set

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

High in-policy classification performance on internal test set.

Competitive off-policy zero-shot performance on OpenAI moderation set.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding