FFT: a 2,116-instance benchmark that measures LLM factuality, fairness, and toxicity

November 30, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.55

Cost Impact Score

0.3

Citation Count

4

Authors

Shiyao Cui, Zhenyu Zhang, Yilong Chen, Wenyuan Zhang, Tianyun Liu, Siqi Wang, Tingwen Liu

Links

Abstract / PDF

Why It Matters For Business

FFT shows models can spread wrong facts, make biased decisions, or appear safe out of context; companies must test models for factual errors and context-aware toxicity before using them in products.

Summary TLDR

The authors release FFT, a 2,116-example benchmark that probes three harm dimensions of LLM outputs: factuality (misinformation and counterfacts), fairness (identity preference, credit/criminal/health decisions across 17 identities), and toxicity (utterance and context-level using jailbreak prompts). They evaluate 9 models (GPT-4, GPT-3.5, Llama2 variants, Vicuna) and find substantial gaps: factuality and counterfact handling are weak, fairness varies by model and identity, and context-aware toxicity is worse than literal toxicity. They show supervised fine-tuning (SFT) and RLHF help, and they provide templates and data at the FFT GitHub.

Problem Statement

Current LLM harmlessness tests focus mostly on toxic language. Real harm also comes from factual errors and biased decisions. We need a compact, practical benchmark that tests whether models refuse, correct, or safely handle misleading facts, identity-sensitive predictions, and context-sensitive toxicity.

Main Contribution

FFT benchmark: 2,116 curated queries covering factuality (misinformation + counterfacts), fairness (identity preference, credit/criminal/health), and toxicity (utterance and context).

Evaluation of 9 popular LLMs (closed and open) under the same prompts and metrics, with automatic scoring and human-style checks.

Analysis of how SFT, RLHF, and model scale affect harmlessness and practical recommendations for evaluation.

Key Findings

Factuality is weak, especially on counterfactual prompts.

NumbersTable 4: GPT-4 overall factuality 0.54; counterfacts accuracy 0.254

Fairness varies by model; GPT-4 shows lower disparity than many open-source models.

NumbersTable 6: GPT-4 overall CV 0.483 (lower is better) vs Llama2-chat-70B 0.888

Context-level toxicity is worse than literal utterance toxicity across models.

NumbersTable 7: GPT-4 utterance 0.84 vs context 0.678 (1 - toxicity metric)

SFT and RLHF reduce harmful outputs and improve refusal/uncertainty behaviors.

NumbersSection 5.1 and Tables 7/4: RLHF-tuned chat models outperform non-RLHF models on toxicity and factuality refusal rates (

Results

Accuracy

ValueGPT-4 0.54; GPT-3.5 0.548; Llama2-chat-13B 0.55; Llama2-chat-70B 0.547

Fairness overall CV (lower better)

ValueGPT-4 0.483; GPT-3.5 0.468; Llama2-chat-70B 0.888; Llama2-chat-13B 0.767

Toxicity overall (non-toxicity = 1 - toxicity)

ValueGPT-4 0.759; GPT-3.5 0.722; Llama2-chat-13B 0.807; Llama2-chat-7B 0.838

Who Should Care

What To Try In 7 Days

Run FFT or a subset on your model to surface counterfactual and context-toxicity failures

Add a refusal/uncertainty layer for factual claims and route high-risk answers to verification

Measure fairness with CV across sensitive groups on your application tasks and log disparities

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Toxicity judgments rely on GPT-4 evaluator and Perspective-API, which can introduce judge bias
  • Fairness tests focus on 17 identities across gender, race, religion and may miss other demographic axes
  • Jailbreak templates expose unsafe behavior but may not cover real-world prompting variations

When Not To Use

  • As a sole liability measure for safety; FFT is diagnostic and not a legal or fairness compliance test
  • To evaluate non-identity-based biases not covered in the seed sets
  • For training models; FFT is designed for evaluation only

Failure Modes

  • Models that refuse frequently produce lower measured factuality even when correct
  • GPT-4 as context judge can under- or over-report toxicity due to its own biases
  • Open-source models can be overly literal to inputs and thus appear worse under certain prompt styles

Core Entities

Models

  • GPT-4
  • GPT-3.5
  • Llama2-chat-70B
  • Llama2-chat-13B
  • Llama2-chat-7B
  • Vicuna-13B
  • Vicuna-7B
  • Llama2-13B
  • Llama2-7B

Metrics

  • Accuracy
  • Coefficient of Variation (fairness)
  • Perspective API toxicity (utterance)
  • GPT-4 human-like evaluator (context toxicity)
  • Non-toxicity = 1 - toxicity score

Datasets

  • FFT (this work)
  • seed sources: Wikipedia, Reddit, public datasets for credit/crime/health
  • RealToxicityPrompts (for toxicity seeds)

Benchmarks

  • FFT