FaiRLLM: a benchmark showing ChatGPT gives uneven recommendations across user attributes

May 12, 20236 min

Overview

Decision SnapshotReady For Pilot

The benchmark is a practical first step with clear metrics and datasets, but validation covers only ChatGPT and two domains; further cross-model and real-world tests are needed before deployment.

Citations17

Evidence Strength0.75

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 50%

Novelty: 60%

Authors

Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you use LLMs to generate recommendations, they can favor or disfavor user groups; auditing with a generative-aware fairness test prevents reputational and regulatory risk.

Who Should Care

Summary TLDR

The paper builds FaiRLLM, a benchmark and dataset to test user-side fairness of LLM-based recommendation (RecLLM). It defines fairness as whether recommendations without sensitive info match those when specific sensitive attributes are specified. The authors propose three list-similarity scores (Jaccard, SERP*, PRAG*), two fairness metrics (SNSR range, SNSV variance), and datasets for music and movies covering eight sensitive attributes. They run ChatGPT (greedy decoding) and find measurable unfairness that persists across list lengths, typos, and Chinese/English prompts. Code and data are released.

Problem Statement

Large language models can be used to generate recommendations, but existing fairness benchmarks assume fixed candidate sets or numeric scores. Those assumptions break for generative RecLLM. We need a new, practical way to measure whether an LLM favors or disfavors user groups when sensitive attributes are hidden.

Main Contribution

A new benchmark (FaiRLLM) tailored to generative recommendation fairness.

Three similarity measures (Jaccard, SERP*, PRAG*) and two fairness statistics (SNSR, SNSV) for RecLLM.

Key Findings

ChatGPT shows measurable unfairness on movie recommendations when measured by pairwise ranking agreement (PRAG*@20).

NumbersMovie PRAG*@20 SNSR up to 0.2191; SNSV up to 0.0828 (Table 1)

Practical UseAudit LLM recommendation outputs: differences can reach ~22 percentage points across attribute values, so vulnerable groups may receive noticeably different lists.

Evidence RefTable 1 (PRAG*@20, Movie)

Music recommendations are more similar overall but still show attribute gaps.

NumbersMusic PRAG*@20 for religion: max similarity 0.7057, min 0.6503, SNSR=0.0554, SNSV=0.0248 (Table 1)

Practical UseEven small similarity gaps (~5.5 points) can signal systematic preference; treat small divergences as actionable in production audits.

Evidence RefTable 1 (PRAG*@20, Music)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
PRAG*@20 SNSR (movie, race)0.2191Movie datasetTable 1 shows PRAG*@20 SNSR for race up to 0.2191Table 1
PRAG*@20 SNSV (movie, race)0.0828Movie datasetTable 1 shows PRAG*@20 SNSV for race up to 0.0828Table 1

What To Try In 7 Days

Run the FaiRLLM prompts on your LLM for key sensitive attributes and compute SNSR/SNSV.

Compare neutral vs attribute-injected outputs using PRAG*@K to find ranking-level differences.

Test robustness: run typos and non-English prompts to reveal hidden failure modes.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation only runs on ChatGPT (greedy decoding); other LLMs may behave differently.

Datasets use famous singers/directors which bias the evaluation toward popular items.

When Not To Use

When you need item-level score-based fairness (paper assumes generative outputs without scores).

For cold-start items that are unlikely to appear in generation outputs.

Failure Modes

Disadvantage aligns with social stereotypes (e.g., 'African' disadvantaged for continent).

Typos can amplify disadvantage for groups similar to a vulnerable value.

Core Entities

Models

ChatGPT

Metrics

Jaccard@KSERP*@KPRAG*@KSNSR (Sensitive-to-Neutral Similarity Range)SNSV (Sensitive-to-Neutral Similarity Variance)

Datasets

FaiRLLM-musicFaiRLLM-movieFaiRLLM

Benchmarks

FaiRLLM