FFT: a 2,116-instance benchmark that measures LLM factuality, fairness, and toxicity

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and actionable for safety audits, but uses GPT-4 as a judge and jailbreaks that can bias results; treat scores as diagnostic, not definitive.

Citations4

Evidence Strength0.65

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 55%

Authors

Shiyao Cui, Zhenyu Zhang, Yilong Chen, Wenyuan Zhang, Tianyun Liu, Siqi Wang, Tingwen Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FFT shows models can spread wrong facts, make biased decisions, or appear safe out of context; companies must test models for factual errors and context-aware toxicity before using them in products.

Who Should Care

Product Manager ML Engineer CTO Founder Data Scientist

Summary TLDR

The authors release FFT, a 2,116-example benchmark that probes three harm dimensions of LLM outputs: factuality (misinformation and counterfacts), fairness (identity preference, credit/criminal/health decisions across 17 identities), and toxicity (utterance and context-level using jailbreak prompts). They evaluate 9 models (GPT-4, GPT-3.5, Llama2 variants, Vicuna) and find substantial gaps: factuality and counterfact handling are weak, fairness varies by model and identity, and context-aware toxicity is worse than literal toxicity. They show supervised fine-tuning (SFT) and RLHF help, and they provide templates and data at the FFT GitHub.

Problem Statement

Current LLM harmlessness tests focus mostly on toxic language. Real harm also comes from factual errors and biased decisions. We need a compact, practical benchmark that tests whether models refuse, correct, or safely handle misleading facts, identity-sensitive predictions, and context-sensitive toxicity.

Main Contribution

FFT benchmark: 2,116 curated queries covering factuality (misinformation + counterfacts), fairness (identity preference, credit/criminal/health), and toxicity (utterance and context).

Evaluation of 9 popular LLMs (closed and open) under the same prompts and metrics, with automatic scoring and human-style checks.

Key Findings

Factuality is weak, especially on counterfactual prompts.

NumbersTable 4: GPT-4 overall factuality 0.54; counterfacts accuracy 0.254

Practical UseInclude adversarial counterfact tests in evaluations and treat model facts cautiously; add refusal/uncertainty signals and external verification for any factual claim.

Evidence RefTable 4

Fairness varies by model; GPT-4 shows lower disparity than many open-source models.

NumbersTable 6: GPT-4 overall CV 0.483 (lower is better) vs Llama2-chat-70B 0.888

Practical UseMeasure coefficient of variation across sensitive groups in deployment tasks and prioritize RLHF/SFT steps if disparities exceed acceptable CV thresholds.

Evidence RefTable 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4 0.54; GPT-3.5 0.548; Llama2-chat-13B 0.55; Llama2-chat-70B 0.547	—	—	FFT factuality (misinformation + counterfacts)	Table 4 reports per-model accuracy across True-False, open-ended, and counterfacts	Table 4
Fairness overall CV (lower better)	GPT-4 0.483; GPT-3.5 0.468; Llama2-chat-70B 0.888; Llama2-chat-13B 0.767	—	—	FFT fairness (gender, race, religion tasks)	Table 6 lists coefficient of variation per identity group and overall	Table 6

What To Try In 7 Days

Run FFT or a subset on your model to surface counterfactual and context-toxicity failures

Add a refusal/uncertainty layer for factual claims and route high-risk answers to verification

Measure fairness with CV across sensitive groups on your application tasks and log disparities

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/cuishiyao96/FFT

Data URLs

https://github.com/cuishiyao96/FFT

Risks & Boundaries

Limitations

Toxicity judgments rely on GPT-4 evaluator and Perspective-API, which can introduce judge bias

Fairness tests focus on 17 identities across gender, race, religion and may miss other demographic axes

When Not To Use

As a sole liability measure for safety; FFT is diagnostic and not a legal or fairness compliance test

To evaluate non-identity-based biases not covered in the seed sets

Failure Modes

Models that refuse frequently produce lower measured factuality even when correct

GPT-4 as context judge can under- or over-report toxicity due to its own biases

Core Entities

Models

GPT-4GPT-3.5Llama2-chat-70BLlama2-chat-13BLlama2-chat-7BVicuna-13BVicuna-7BLlama2-13BLlama2-7B

Metrics

AccuracyCoefficient of Variation (fairness)Perspective API toxicity (utterance)GPT-4 human-like evaluator (context toxicity)Non-toxicity = 1 - toxicity score

Datasets

FFT (this work)seed sources: Wikipedia, Reddit, public datasets for credit/crime/healthRealToxicityPrompts (for toxicity seeds)

Benchmarks

FFT

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Factuality is weak, especially on counterfactual prompts.

Fairness varies by model; GPT-4 shows lower disparity than many open-source models.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding