Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.3
Citation Count
2
Why It Matters For Business
ROBBIE helps teams quantify which user groups a generative model may mistreat and compare mitigations quickly; use it to reduce legal, PR, and product harm before deployment.
Summary TLDR
ROBBIE is a benchmark suite and toolkit for measuring social bias and toxicity in generative LLMs. The paper adds two new resources (AdvPromptSet, HolisticBiasR), evaluates 5 model families across 6 prompt-based metrics and 12 demographic axes, measures demographic term frequency in common corpora, and compares three mitigation methods (prompting, self-debiasing, adversarial triggers). Key findings: biases show up differently depending on dataset and metric; self-debiasing helps smaller models (GPT2-XL) but not always large conversational models (BB3-175B); prompting is a strong baseline for larger models. Code and downsized datasets are open-sourced.
Problem Statement
Current bias evaluations use different datasets and metrics and cover only a few demographic axes. That makes cross-model comparisons and choosing mitigations hard. The paper builds a unified, broader test suite and tests mitigations to give practitioners clearer, comparative guidance.
Main Contribution
ROBBIE: a multi-metric benchmark comparing 6 prompt-based bias/toxicity metrics across 12 demographic axes and 5 LLM families
AdvPromptSet: a large adversarial prompt set for intersectional testing (downsizable release)
HolisticBiasR: Regard templates expanded with 700+ demographic identity terms
Systematic comparison of three mitigation approaches across models, metrics, and axes
Analysis of demographic-term frequencies in common pretraining corpora and open-sourced code/tooling
Key Findings
Self-debiasing substantially reduces toxicity in a smaller base model (GPT2-XL).
Prompting is more effective than self-debiasing on a large conversational model (BB3-175B).
Bias and toxicity depend strongly on the evaluation dataset and metric.
AdvPromptSet is large and intersectional, exposing higher-risk prompts.
Demographic term frequencies vary across common corpora and do not map simply to model bias.
Results
Toxicity rate (example model)
Mitigation effect — self-debiasing (GPT2-XL)
Mitigation effect — prompting (BB3-175B)
AdvPromptSet scale
Who Should Care
What To Try In 7 Days
Run the 10k AdvPromptSet sample and HolisticBiasR on your base model to map high-risk subgroups
Apply self-debiasing at inference for smaller base models and compare toxicity rates
Try a few instruction-style prompt templates and measure trade-offs in coherence vs toxicity
Reproducibility
Data Urls
- https://github.com/facebookresearch/ResponsibleNLP/tree/main/AdvPromptSet
- HolisticBias dataset referenced (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- English-only datasets; results may not generalize to other languages
- AdvPromptSet inherits annotation noise from Jigsaw sources and can include multiple labels per prompt
- Automatic toxicity classifiers are imperfect and culturally contextual; human judgments are imperfectly representative
- Mitigations tested in isolation; combined or tuned methods may behave differently
When Not To Use
- As the sole safety check for deployed systems; use along with policy, adversarial testing, and human review
- For non-English models or deployment regions without cultural validation
- To claim a model is 'bias-free'—the suite surfaces issues but does not prove absence of harm
Failure Modes
- Mitigations trade off coherence or hide rather than remove bias in some settings
- Large models may follow prompts but still marginalize subgroups under other prompt styles
- Classifier measurement noise can mask small changes when baseline toxicity is very low
Core Entities
Models
- GPT-2
- OPT
- BlenderBot 3 (BB3)
- BLOOM
- LLaMa
Metrics
- BiasScore (percentage of subgroups above background)
- Toxicity % (Perspective API or ToxiGen classifier)
- Negative regard % (Regard classifier)
- Perplexity (text-davinci-002)
- Latency (ms/token)
- Peak GPU memory (GB)
Datasets
- AdvPromptSet (new)
- HolisticBiasR (new)
- Regard
- RealToxicityPrompts
- BOLD
- ToxiGen v2
- WikiText-103 (performance sampling)
- Jigsaw toxicity datasets (source for AdvPromptSet)
Benchmarks
- ROBBIE (this paper's multi-metric suite)

