FaiRLLM: a benchmark showing ChatGPT gives uneven recommendations across user attributes

Overview

Decision SnapshotReady For Pilot

The benchmark is a practical first step with clear metrics and datasets, but validation covers only ChatGPT and two domains; further cross-model and real-world tests are needed before deployment.

Citations17

Evidence Strength0.75

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 50%

Novelty: 60%

Authors

Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you use LLMs to generate recommendations, they can favor or disfavor user groups; auditing with a generative-aware fairness test prevents reputational and regulatory risk.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Engineering Lead

Summary TLDR

The paper builds FaiRLLM, a benchmark and dataset to test user-side fairness of LLM-based recommendation (RecLLM). It defines fairness as whether recommendations without sensitive info match those when specific sensitive attributes are specified. The authors propose three list-similarity scores (Jaccard, SERP*, PRAG*), two fairness metrics (SNSR range, SNSV variance), and datasets for music and movies covering eight sensitive attributes. They run ChatGPT (greedy decoding) and find measurable unfairness that persists across list lengths, typos, and Chinese/English prompts. Code and data are released.

Problem Statement

Large language models can be used to generate recommendations, but existing fairness benchmarks assume fixed candidate sets or numeric scores. Those assumptions break for generative RecLLM. We need a new, practical way to measure whether an LLM favors or disfavors user groups when sensitive attributes are hidden.

Main Contribution

A new benchmark (FaiRLLM) tailored to generative recommendation fairness.

Three similarity measures (Jaccard, SERP*, PRAG*) and two fairness statistics (SNSR, SNSV) for RecLLM.

Key Findings

ChatGPT shows measurable unfairness on movie recommendations when measured by pairwise ranking agreement (PRAG*@20).

NumbersMovie PRAG*@20 SNSR up to 0.2191; SNSV up to 0.0828 (Table 1)

Practical UseAudit LLM recommendation outputs: differences can reach ~22 percentage points across attribute values, so vulnerable groups may receive noticeably different lists.

Evidence RefTable 1 (PRAG*@20, Movie)

Music recommendations are more similar overall but still show attribute gaps.

NumbersMusic PRAG*@20 for religion: max similarity 0.7057, min 0.6503, SNSR=0.0554, SNSV=0.0248 (Table 1)

Practical UseEven small similarity gaps (~5.5 points) can signal systematic preference; treat small divergences as actionable in production audits.

Evidence RefTable 1 (PRAG*@20, Music)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
PRAG*@20 SNSR (movie, race)	0.2191	—	—	Movie dataset	Table 1 shows PRAG*@20 SNSR for race up to 0.2191	Table 1
PRAG*@20 SNSV (movie, race)	0.0828	—	—	Movie dataset	Table 1 shows PRAG*@20 SNSV for race up to 0.0828	Table 1

What To Try In 7 Days

Run the FaiRLLM prompts on your LLM for key sensitive attributes and compute SNSR/SNSV.

Compare neutral vs attribute-injected outputs using PRAG*@K to find ranking-level differences.

Test robustness: run typos and non-English prompts to reveal hidden failure modes.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/jizhi-zhang/FaiRLLM

Data URLs

https://github.com/jizhi-zhang/FaiRLLM

Risks & Boundaries

Limitations

Evaluation only runs on ChatGPT (greedy decoding); other LLMs may behave differently.

Datasets use famous singers/directors which bias the evaluation toward popular items.

When Not To Use

When you need item-level score-based fairness (paper assumes generative outputs without scores).

For cold-start items that are unlikely to appear in generation outputs.

Failure Modes

Disadvantage aligns with social stereotypes (e.g., 'African' disadvantaged for continent).

Typos can amplify disadvantage for groups similar to a vulnerable value.

Core Entities

Models

ChatGPT

Metrics

Jaccard@KSERP*@KPRAG*@KSNSR (Sensitive-to-Neutral Similarity Range)SNSV (Sensitive-to-Neutral Similarity Variance)

Datasets

FaiRLLM-musicFaiRLLM-movieFaiRLLM

Benchmarks

FaiRLLM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ChatGPT shows measurable unfairness on movie recommendations when measured by pairwise ranking agreement (PRAG*@20).

Music recommendations are more similar overall but still show attribute gaps.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding