FFT: a 2,116-instance benchmark that measures LLM factuality, fairness, and toxicity

November 30, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and actionable for safety audits, but uses GPT-4 as a judge and jailbreaks that can bias results; treat scores as diagnostic, not definitive.

Citations4

Evidence Strength0.65

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 55%

Authors

Shiyao Cui, Zhenyu Zhang, Yilong Chen, Wenyuan Zhang, Tianyun Liu, Siqi Wang, Tingwen Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FFT shows models can spread wrong facts, make biased decisions, or appear safe out of context; companies must test models for factual errors and context-aware toxicity before using them in products.

Who Should Care

Summary TLDR

The authors release FFT, a 2,116-example benchmark that probes three harm dimensions of LLM outputs: factuality (misinformation and counterfacts), fairness (identity preference, credit/criminal/health decisions across 17 identities), and toxicity (utterance and context-level using jailbreak prompts). They evaluate 9 models (GPT-4, GPT-3.5, Llama2 variants, Vicuna) and find substantial gaps: factuality and counterfact handling are weak, fairness varies by model and identity, and context-aware toxicity is worse than literal toxicity. They show supervised fine-tuning (SFT) and RLHF help, and they provide templates and data at the FFT GitHub.

Problem Statement

Current LLM harmlessness tests focus mostly on toxic language. Real harm also comes from factual errors and biased decisions. We need a compact, practical benchmark that tests whether models refuse, correct, or safely handle misleading facts, identity-sensitive predictions, and context-sensitive toxicity.

Main Contribution

FFT benchmark: 2,116 curated queries covering factuality (misinformation + counterfacts), fairness (identity preference, credit/criminal/health), and toxicity (utterance and context).

Evaluation of 9 popular LLMs (closed and open) under the same prompts and metrics, with automatic scoring and human-style checks.

Key Findings

Factuality is weak, especially on counterfactual prompts.

NumbersTable 4: GPT-4 overall factuality 0.54; counterfacts accuracy 0.254

Practical UseInclude adversarial counterfact tests in evaluations and treat model facts cautiously; add refusal/uncertainty signals and external verification for any factual claim.

Evidence RefTable 4

Fairness varies by model; GPT-4 shows lower disparity than many open-source models.

NumbersTable 6: GPT-4 overall CV 0.483 (lower is better) vs Llama2-chat-70B 0.888

Practical UseMeasure coefficient of variation across sensitive groups in deployment tasks and prioritize RLHF/SFT steps if disparities exceed acceptable CV thresholds.

Evidence RefTable 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4 0.54; GPT-3.5 0.548; Llama2-chat-13B 0.55; Llama2-chat-70B 0.547FFT factuality (misinformation + counterfacts)Table 4 reports per-model accuracy across True-False, open-ended, and counterfactsTable 4
Fairness overall CV (lower better)GPT-4 0.483; GPT-3.5 0.468; Llama2-chat-70B 0.888; Llama2-chat-13B 0.767FFT fairness (gender, race, religion tasks)Table 6 lists coefficient of variation per identity group and overallTable 6

What To Try In 7 Days

Run FFT or a subset on your model to surface counterfactual and context-toxicity failures

Add a refusal/uncertainty layer for factual claims and route high-risk answers to verification

Measure fairness with CV across sensitive groups on your application tasks and log disparities

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Toxicity judgments rely on GPT-4 evaluator and Perspective-API, which can introduce judge bias

Fairness tests focus on 17 identities across gender, race, religion and may miss other demographic axes

When Not To Use

As a sole liability measure for safety; FFT is diagnostic and not a legal or fairness compliance test

To evaluate non-identity-based biases not covered in the seed sets

Failure Modes

Models that refuse frequently produce lower measured factuality even when correct

GPT-4 as context judge can under- or over-report toxicity due to its own biases

Core Entities

Models

GPT-4GPT-3.5Llama2-chat-70BLlama2-chat-13BLlama2-chat-7BVicuna-13BVicuna-7BLlama2-13BLlama2-7B

Metrics

AccuracyCoefficient of Variation (fairness)Perspective API toxicity (utterance)GPT-4 human-like evaluator (context toxicity)Non-toxicity = 1 - toxicity score

Datasets

FFT (this work)seed sources: Wikipedia, Reddit, public datasets for credit/crime/healthRealToxicityPrompts (for toxicity seeds)

Benchmarks

FFT