FaiRLLM: a benchmark showing ChatGPT gives uneven recommendations across user attributes
If you use LLMs to generate recommendations, they can favor or disfavor user groups; auditing with a generative-aware fairness test prevents reputational and regulatory risk.
Key finding
ChatGPT shows measurable unfairness on movie recommendations when measured by pairwise ranking agreement (PRAG*@20).
Numbers: Movie PRAG*@20 SNSR up to 0.2191; SNSV up to 0.0828 (Table 1)

