Overview
The benchmark is practical and actionable for safety audits, but uses GPT-4 as a judge and jailbreaks that can bias results; treat scores as diagnostic, not definitive.
Citations4
Evidence Strength0.65
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 40%
Novelty: 55%
Why It Matters For Business
FFT shows models can spread wrong facts, make biased decisions, or appear safe out of context; companies must test models for factual errors and context-aware toxicity before using them in products.
Who Should Care
Summary TLDR
The authors release FFT, a 2,116-example benchmark that probes three harm dimensions of LLM outputs: factuality (misinformation and counterfacts), fairness (identity preference, credit/criminal/health decisions across 17 identities), and toxicity (utterance and context-level using jailbreak prompts). They evaluate 9 models (GPT-4, GPT-3.5, Llama2 variants, Vicuna) and find substantial gaps: factuality and counterfact handling are weak, fairness varies by model and identity, and context-aware toxicity is worse than literal toxicity. They show supervised fine-tuning (SFT) and RLHF help, and they provide templates and data at the FFT GitHub.
Problem Statement
Current LLM harmlessness tests focus mostly on toxic language. Real harm also comes from factual errors and biased decisions. We need a compact, practical benchmark that tests whether models refuse, correct, or safely handle misleading facts, identity-sensitive predictions, and context-sensitive toxicity.
Main Contribution
FFT benchmark: 2,116 curated queries covering factuality (misinformation + counterfacts), fairness (identity preference, credit/criminal/health), and toxicity (utterance and context).
Evaluation of 9 popular LLMs (closed and open) under the same prompts and metrics, with automatic scoring and human-style checks.
Key Findings
Factuality is weak, especially on counterfactual prompts.
Fairness varies by model; GPT-4 shows lower disparity than many open-source models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4 0.54; GPT-3.5 0.548; Llama2-chat-13B 0.55; Llama2-chat-70B 0.547 | — | — | FFT factuality (misinformation + counterfacts) | Table 4 reports per-model accuracy across True-False, open-ended, and counterfacts | Table 4 |
| Fairness overall CV (lower better) | GPT-4 0.483; GPT-3.5 0.468; Llama2-chat-70B 0.888; Llama2-chat-13B 0.767 | — | — | FFT fairness (gender, race, religion tasks) | Table 6 lists coefficient of variation per identity group and overall | Table 6 |
What To Try In 7 Days
Run FFT or a subset on your model to surface counterfactual and context-toxicity failures
Add a refusal/uncertainty layer for factual claims and route high-risk answers to verification
Measure fairness with CV across sensitive groups on your application tasks and log disparities
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Toxicity judgments rely on GPT-4 evaluator and Perspective-API, which can introduce judge bias
Fairness tests focus on 17 identities across gender, race, religion and may miss other demographic axes
When Not To Use
As a sole liability measure for safety; FFT is diagnostic and not a legal or fairness compliance test
To evaluate non-identity-based biases not covered in the seed sets
Failure Modes
Models that refuse frequently produce lower measured factuality even when correct
GPT-4 as context judge can under- or over-report toxicity due to its own biases

