Overview
Production Readiness
0.4
Novelty Score
0.55
Cost Impact Score
0.3
Citation Count
4
Why It Matters For Business
FFT shows models can spread wrong facts, make biased decisions, or appear safe out of context; companies must test models for factual errors and context-aware toxicity before using them in products.
Summary TLDR
The authors release FFT, a 2,116-example benchmark that probes three harm dimensions of LLM outputs: factuality (misinformation and counterfacts), fairness (identity preference, credit/criminal/health decisions across 17 identities), and toxicity (utterance and context-level using jailbreak prompts). They evaluate 9 models (GPT-4, GPT-3.5, Llama2 variants, Vicuna) and find substantial gaps: factuality and counterfact handling are weak, fairness varies by model and identity, and context-aware toxicity is worse than literal toxicity. They show supervised fine-tuning (SFT) and RLHF help, and they provide templates and data at the FFT GitHub.
Problem Statement
Current LLM harmlessness tests focus mostly on toxic language. Real harm also comes from factual errors and biased decisions. We need a compact, practical benchmark that tests whether models refuse, correct, or safely handle misleading facts, identity-sensitive predictions, and context-sensitive toxicity.
Main Contribution
FFT benchmark: 2,116 curated queries covering factuality (misinformation + counterfacts), fairness (identity preference, credit/criminal/health), and toxicity (utterance and context).
Evaluation of 9 popular LLMs (closed and open) under the same prompts and metrics, with automatic scoring and human-style checks.
Analysis of how SFT, RLHF, and model scale affect harmlessness and practical recommendations for evaluation.
Key Findings
Factuality is weak, especially on counterfactual prompts.
Fairness varies by model; GPT-4 shows lower disparity than many open-source models.
Context-level toxicity is worse than literal utterance toxicity across models.
SFT and RLHF reduce harmful outputs and improve refusal/uncertainty behaviors.
Results
Accuracy
Fairness overall CV (lower better)
Toxicity overall (non-toxicity = 1 - toxicity)
Who Should Care
What To Try In 7 Days
Run FFT or a subset on your model to surface counterfactual and context-toxicity failures
Add a refusal/uncertainty layer for factual claims and route high-risk answers to verification
Measure fairness with CV across sensitive groups on your application tasks and log disparities
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Toxicity judgments rely on GPT-4 evaluator and Perspective-API, which can introduce judge bias
- Fairness tests focus on 17 identities across gender, race, religion and may miss other demographic axes
- Jailbreak templates expose unsafe behavior but may not cover real-world prompting variations
When Not To Use
- As a sole liability measure for safety; FFT is diagnostic and not a legal or fairness compliance test
- To evaluate non-identity-based biases not covered in the seed sets
- For training models; FFT is designed for evaluation only
Failure Modes
- Models that refuse frequently produce lower measured factuality even when correct
- GPT-4 as context judge can under- or over-report toxicity due to its own biases
- Open-source models can be overly literal to inputs and thus appear worse under certain prompt styles
Core Entities
Models
- GPT-4
- GPT-3.5
- Llama2-chat-70B
- Llama2-chat-13B
- Llama2-chat-7B
- Vicuna-13B
- Vicuna-7B
- Llama2-13B
- Llama2-7B
Metrics
- Accuracy
- Coefficient of Variation (fairness)
- Perspective API toxicity (utterance)
- GPT-4 human-like evaluator (context toxicity)
- Non-toxicity = 1 - toxicity score
Datasets
- FFT (this work)
- seed sources: Wikipedia, Reddit, public datasets for credit/crime/health
- RealToxicityPrompts (for toxicity seeds)
Benchmarks
- FFT

