Llama Guard — an adaptable LLM filter that flags unsafe user prompts and AI responses

December 7, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.35

Cost Impact Score

0.45

Citation Count

44

Authors

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa

Links

Abstract / PDF

Why It Matters For Business

Llama Guard is a deployable, customizable safety filter that runs locally, adapts to new policies via prompts or light fine-tuning, and matches or beats common moderation APIs on public and internal tests.

Summary TLDR

Llama Guard is a fine-tuned Llama2-7b model built to classify safety risks in both user prompts and model responses. The authors designed a small taxonomy (violence, sexual, criminal planning, guns, substances, self-harm, safe), annotated ~14k prompt/response pairs, and instruction-tuned the model to output safe/unsafe plus violated categories. Llama Guard matches or beats common moderation APIs on internal and public benchmarks (AUPRC: 0.945 on internal prompts, 0.953 on internal responses; 0.847 on OpenAI Mod zero-shot; 0.626 on ToxicChat zero-shot) and adapts well via zero-shot, few-shot, or light fine-tuning. Weights and code are released.

Problem Statement

Existing content-moderation APIs are rigid: they use fixed taxonomies, provide only API access, and often use small backbones. Products need a customizable, high-quality guardrail that checks both user inputs and model outputs, adapts to different policies, and can be fine-tuned locally.

Main Contribution

A compact safety taxonomy for human-AI conversations covering violence, sexual content, criminal planning, guns, substances, and self-harm.

Llama Guard: an instruction-tuned Llama2-7b model that classifies prompts and responses and lists violated taxonomy categories.

A labeled dataset of 13,997 prompt-response pairs annotated for prompt/response category and safe/unsafe labels.

Demonstrations that the model adapts via zero-shot, few-shot, and small amounts of fine-tuning and that code and model weights are released.

Key Findings

High in-policy classification performance on internal test set.

NumbersAUPRC prompt=0.945; response=0.953 (Table 2)

Competitive off-policy zero-shot performance on OpenAI moderation set.

NumbersAUPRC zero-shot=0.847 vs OpenAI API=0.856 (Table 2)

Outperforms other tools on ToxicChat without extra fine-tuning.

NumbersAUPRC Llama Guard=0.626 vs OpenAI=0.588 vs Perspective=0.532 (Table 2)

Few-shot and small fine-tuning speed up adaptation to new taxonomies.

NumbersFew-shot AUPRC on OpenAI Mod=0.872 (vs zero-shot 0.847); needs ~20% ToxicChat to match Llama2-7b on 100% (Fig.3, Sec.4.5

Results

AUPRC (prompt classification)

Value0.945

BaselineOpenAI API 0.764

AUPRC (response classification)

Value0.953

BaselineOpenAI API 0.769

AUPRC (OpenAI Mod eval, zero-shot w/ taxonomy)

Value0.847

BaselineOpenAI Mod API 0.856

AUPRC (ToxicChat, zero-shot w/ taxonomy)

Value0.626

BaselineOpenAI API 0.588

Training data size

Value13,997 examples

Who Should Care

What To Try In 7 Days

Run the released Llama Guard weights on a small sample of your product data to compare labels to current moderation.

Prompt Llama Guard with your policy (zero-shot) and check per-category outputs on edge cases.

Add 2–4 in-context examples per category and re-run to see few-shot gains quickly (<1 hour).

Optimization Features

Training Optimization

  • Instruction fine-tuning on a small, targeted dataset (∼1 epoch, 500 steps)

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training and fine-tuning data mostly in English; non-English performance not guaranteed (Sec.6).
  • Dataset is small (13,997 examples) and may not cover all policy edge cases.
  • As a full LLM, it can be prompted to generate unsafe completions if used as a chat model.
  • Susceptible to prompt injection attacks that may bypass intended behavior.

When Not To Use

  • As the only safety layer for free-form chat generation.
  • For moderation in languages other than English without further data.
  • Where exhaustive policy coverage or legal compliance is required without human review.

Failure Modes

  • False negatives on novel or adversarial prompts leading to missed unsafe outputs.
  • False positives that block benign user content due to taxonomy mismatch.
  • Prompt injection bypasses can change model behavior.
  • Performance may degrade on out-of-domain phrasing not seen in training.

Core Entities

Models

  • Llama2-7b
  • Llama Guard (fine-tuned Llama2-7b)

Metrics

  • AUPRC
  • Precision
  • Recall
  • F1

Datasets

  • Internal annotated dataset (13,997 prompt-response pairs)
  • ToxicChat
  • OpenAI Moderation Evaluation
  • Anthropic harmlessness preference data (seed)

Benchmarks

  • ToxicChat
  • OpenAI Moderation Evaluation
  • Internal test set