Llama Guard is a deployable, customizable safety filter that runs locally, adapts to new policies via prompts or light fine-tuning, and matches or beats common moderation APIs on public and internal tests.
Key finding
High in-policy classification performance on internal test set.
Numbers: AUPRC prompt=0.945; response=0.953 (Table 2)

