Fairness Metrics Papers — Parsed & Scored for Practitioners

FaiRLLM: a benchmark showing ChatGPT gives uneven recommendations across user attributes

0.50

0.60

0.25

17

If you use LLMs to generate recommendations, they can favor or disfavor user groups; auditing with a generative-aware fairness test prevents reputational and regulatory risk.

Key finding

ChatGPT shows measurable unfairness on movie recommendations when measured by pairwise ranking agreement (PRAG*@20).

Numbers: Movie PRAG*@20 SNSR up to 0.2191; SNSV up to 0.0828 (Table 1)

FairPy: Open toolkit to measure and reduce token-level bias in common language models

0.60

0.40

0.50

6

FairPy makes bias audits repeatable and faster across multiple metrics and models, but mitigation effects are metric-dependent so teams must validate fixes with several tests before deployment.

Key finding

FairPy collects common bias metrics and mitigation methods into one toolkit.

Make LLM recommenders auditable: proposer LLM + deterministic verifier + repair

0.80

0.60

0.30

0

PCN-Rec turns persuasive LLM outputs into auditable, machine-checked recommendations so platforms can guarantee and log policy compliance without trusting LLM explanations.

Key finding

On MovieLens-100K, selecting W=80 gave 551 of 943 users with at least one compliant slate inside the window.

Numbers: 551 / 943 users (W=80)

Treat fairness as an emergent property in multi-agent systems; a framework and simulation show demographic parity narrows group reward gaps

0.40

0.60

0.50

0

Decentralized agent systems can amplify bias and create unfair outcomes; adding fairness checks and incentives early reduces legal, reputational, and customer-risk without obvious short-term performance loss in toy tests.

Key finding

Applying a demographic-parity fairness adjustment narrowed the final group reward gap from 45 points to 5 points in the 10-agent simulation.

Numbers: With fairness: Group A 375 vs Group B 370; Without: 390 vs 345 (gap 5 vs 45).

Pool auditors' queries to cut fairness-estimate error — but avoid heavy pre-coordination when many agents join.

0.60

0.50

0.60

0

Pooling audit queries across teams cuts the number of queries needed to detect bias and improves fairness estimates; avoid heavy pre-coordination when many teams audit the same platform because it can increase error.

Key finding

Collaboration reduces average DP estimation variance versus independent audits.

Numbers: Empirical DP error reduced by 17.4%–24.6% (Fig.3, Section 6.1).

Task vectors can tune fairness: scale them to trade accuracy for group parity

0.60

0.45

0.65

0

Task-vector editing is a low-cost way to tune subgroup parity without full retraining; it can be used as an operational knob to reduce worst-case demographic gaps while keeping accuracy near existing adapt methods.

Key finding

Uniformly scaled task-vector merges can reduce group disparities while keeping accuracy close to FFT/LoRA.

Numbers: Civil Comments (DistilBERT): Task Addition accuracy ≈0.9395; worst-DPD 0.0454; worst-EOD 0.3358 (Table 2)

Audit how LLM agents communicate: tone and explanations change decisions even when outcomes don't

0.40

0.60

0.30

0

How agents phrase decisions affects cooperation and task success; monitoring and nudging tone and explanations reduces coordination failures and builds trust in agentic workflows.

Key finding

Respectful tone and clear justification increase proposal acceptance even when resource splits are identical.

Numbers: High-High (5:5) acceptance = 1.0 (Table 3)