Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.25
Citation Count
17
Why It Matters For Business
If you use LLMs to generate recommendations, they can favor or disfavor user groups; auditing with a generative-aware fairness test prevents reputational and regulatory risk.
Summary TLDR
The paper builds FaiRLLM, a benchmark and dataset to test user-side fairness of LLM-based recommendation (RecLLM). It defines fairness as whether recommendations without sensitive info match those when specific sensitive attributes are specified. The authors propose three list-similarity scores (Jaccard, SERP*, PRAG*), two fairness metrics (SNSR range, SNSV variance), and datasets for music and movies covering eight sensitive attributes. They run ChatGPT (greedy decoding) and find measurable unfairness that persists across list lengths, typos, and Chinese/English prompts. Code and data are released.
Problem Statement
Large language models can be used to generate recommendations, but existing fairness benchmarks assume fixed candidate sets or numeric scores. Those assumptions break for generative RecLLM. We need a new, practical way to measure whether an LLM favors or disfavors user groups when sensitive attributes are hidden.
Main Contribution
A new benchmark (FaiRLLM) tailored to generative recommendation fairness.
Three similarity measures (Jaccard, SERP*, PRAG*) and two fairness statistics (SNSR, SNSV) for RecLLM.
Two evaluation datasets (music and movies) covering eight user-side sensitive attributes and an audit of ChatGPT that exposes uneven behavior.
Key Findings
ChatGPT shows measurable unfairness on movie recommendations when measured by pairwise ranking agreement (PRAG*@20).
Music recommendations are more similar overall but still show attribute gaps.
Unfairness persists under prompt noise and language change.
Results
PRAG*@20 SNSR (movie, race)
PRAG*@20 SNSV (movie, race)
PRAG*@20 similarity range (music, religion)
Who Should Care
What To Try In 7 Days
Run the FaiRLLM prompts on your LLM for key sensitive attributes and compute SNSR/SNSV.
Compare neutral vs attribute-injected outputs using PRAG*@K to find ranking-level differences.
Test robustness: run typos and non-English prompts to reveal hidden failure modes.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation only runs on ChatGPT (greedy decoding); other LLMs may behave differently.
- Datasets use famous singers/directors which bias the evaluation toward popular items.
- Fairness definition focuses on similarity to a neutral prompt; it does not measure downstream user utility or long-term effects.
When Not To Use
- When you need item-level score-based fairness (paper assumes generative outputs without scores).
- For cold-start items that are unlikely to appear in generation outputs.
- If your production LLM has different prompt templates than the benchmark
Failure Modes
- Disadvantage aligns with social stereotypes (e.g., 'African' disadvantaged for continent).
- Typos can amplify disadvantage for groups similar to a vulnerable value.
- Language mixing (e.g., Chinese prompts on English-heavy data) reduces similarity and may distort comparisons.
Core Entities
Models
- ChatGPT
Metrics
- Jaccard@K
- SERP*@K
- PRAG*@K
- SNSR (Sensitive-to-Neutral Similarity Range)
- SNSV (Sensitive-to-Neutral Similarity Variance)
Datasets
- FaiRLLM-music
- FaiRLLM-movie
- FaiRLLM
Benchmarks
- FaiRLLM

