FaiRLLM: a benchmark showing ChatGPT gives uneven recommendations across user attributes

May 12, 20236 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.25

Citation Count

17

Authors

Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He

Links

Abstract / PDF

Why It Matters For Business

If you use LLMs to generate recommendations, they can favor or disfavor user groups; auditing with a generative-aware fairness test prevents reputational and regulatory risk.

Summary TLDR

The paper builds FaiRLLM, a benchmark and dataset to test user-side fairness of LLM-based recommendation (RecLLM). It defines fairness as whether recommendations without sensitive info match those when specific sensitive attributes are specified. The authors propose three list-similarity scores (Jaccard, SERP*, PRAG*), two fairness metrics (SNSR range, SNSV variance), and datasets for music and movies covering eight sensitive attributes. They run ChatGPT (greedy decoding) and find measurable unfairness that persists across list lengths, typos, and Chinese/English prompts. Code and data are released.

Problem Statement

Large language models can be used to generate recommendations, but existing fairness benchmarks assume fixed candidate sets or numeric scores. Those assumptions break for generative RecLLM. We need a new, practical way to measure whether an LLM favors or disfavors user groups when sensitive attributes are hidden.

Main Contribution

A new benchmark (FaiRLLM) tailored to generative recommendation fairness.

Three similarity measures (Jaccard, SERP*, PRAG*) and two fairness statistics (SNSR, SNSV) for RecLLM.

Two evaluation datasets (music and movies) covering eight user-side sensitive attributes and an audit of ChatGPT that exposes uneven behavior.

Key Findings

ChatGPT shows measurable unfairness on movie recommendations when measured by pairwise ranking agreement (PRAG*@20).

NumbersMovie PRAG*@20 SNSR up to 0.2191; SNSV up to 0.0828 (Table 1)

Music recommendations are more similar overall but still show attribute gaps.

NumbersMusic PRAG*@20 for religion: max similarity 0.7057, min 0.6503, SNSR=0.0554, SNSV=0.0248 (Table 1)

Unfairness persists under prompt noise and language change.

Results

PRAG*@20 SNSR (movie, race)

Value0.2191

PRAG*@20 SNSV (movie, race)

Value0.0828

PRAG*@20 similarity range (music, religion)

Valuemax 0.7057 / min 0.6503

Who Should Care

What To Try In 7 Days

Run the FaiRLLM prompts on your LLM for key sensitive attributes and compute SNSR/SNSV.

Compare neutral vs attribute-injected outputs using PRAG*@K to find ranking-level differences.

Test robustness: run typos and non-English prompts to reveal hidden failure modes.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation only runs on ChatGPT (greedy decoding); other LLMs may behave differently.
  • Datasets use famous singers/directors which bias the evaluation toward popular items.
  • Fairness definition focuses on similarity to a neutral prompt; it does not measure downstream user utility or long-term effects.

When Not To Use

  • When you need item-level score-based fairness (paper assumes generative outputs without scores).
  • For cold-start items that are unlikely to appear in generation outputs.
  • If your production LLM has different prompt templates than the benchmark

Failure Modes

  • Disadvantage aligns with social stereotypes (e.g., 'African' disadvantaged for continent).
  • Typos can amplify disadvantage for groups similar to a vulnerable value.
  • Language mixing (e.g., Chinese prompts on English-heavy data) reduces similarity and may distort comparisons.

Core Entities

Models

  • ChatGPT

Metrics

  • Jaccard@K
  • SERP*@K
  • PRAG*@K
  • SNSR (Sensitive-to-Neutral Similarity Range)
  • SNSV (Sensitive-to-Neutral Similarity Variance)

Datasets

  • FaiRLLM-music
  • FaiRLLM-movie
  • FaiRLLM

Benchmarks

  • FaiRLLM