ROBBIE: a multi-dataset, multi-metric bias benchmark plus new adversarial prompts and mitigation tests

November 29, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.3

Citation Count

2

Authors

David Esiobu, Xiaoqing Tan, Saghar Hosseini, Megan Ung, Yuchen Zhang, Jude Fernandes, Jane Dwivedi-Yu, Eleonora Presani, Adina Williams, Eric Michael Smith

Links

Abstract / PDF

Why It Matters For Business

ROBBIE helps teams quantify which user groups a generative model may mistreat and compare mitigations quickly; use it to reduce legal, PR, and product harm before deployment.

Summary TLDR

ROBBIE is a benchmark suite and toolkit for measuring social bias and toxicity in generative LLMs. The paper adds two new resources (AdvPromptSet, HolisticBiasR), evaluates 5 model families across 6 prompt-based metrics and 12 demographic axes, measures demographic term frequency in common corpora, and compares three mitigation methods (prompting, self-debiasing, adversarial triggers). Key findings: biases show up differently depending on dataset and metric; self-debiasing helps smaller models (GPT2-XL) but not always large conversational models (BB3-175B); prompting is a strong baseline for larger models. Code and downsized datasets are open-sourced.

Problem Statement

Current bias evaluations use different datasets and metrics and cover only a few demographic axes. That makes cross-model comparisons and choosing mitigations hard. The paper builds a unified, broader test suite and tests mitigations to give practitioners clearer, comparative guidance.

Main Contribution

ROBBIE: a multi-metric benchmark comparing 6 prompt-based bias/toxicity metrics across 12 demographic axes and 5 LLM families

AdvPromptSet: a large adversarial prompt set for intersectional testing (downsizable release)

HolisticBiasR: Regard templates expanded with 700+ demographic identity terms

Systematic comparison of three mitigation approaches across models, metrics, and axes

Analysis of demographic-term frequencies in common pretraining corpora and open-sourced code/tooling

Key Findings

Self-debiasing substantially reduces toxicity in a smaller base model (GPT2-XL).

Numbers46% mean reduction on evaluated prompting datasets

Prompting is more effective than self-debiasing on a large conversational model (BB3-175B).

Numbers28% mean reduction in toxicity for BB3-175B

Bias and toxicity depend strongly on the evaluation dataset and metric.

NumbersGPT2-XL overall BiasScore ≈ 67% across datasets; toxicity rates range 1.66%–17.7% by dataset

AdvPromptSet is large and intersectional, exposing higher-risk prompts.

NumbersAdvPromptSet ≈ 199k prompts; downsized 10k version released

Demographic term frequencies vary across common corpora and do not map simply to model bias.

Numbers'female' weighted mean 3.51% vs 'male' 2.72% across corpora

Results

Toxicity rate (example model)

ValueGPT2-XL: RTP 1.66%; BOLD 0.35%; ToxiGen v2 11.78%; AdvPromptSet 17.7%; Regard neg. regard 25.1%

Mitigation effect — self-debiasing (GPT2-XL)

ValueMean toxicity reduced from baseline to 0.59% on RTP and similarly across datasets (46% mean reduction claimed)

BaselineBaseline mean toxicity (varies by dataset; e.g. RTP 1.66%)

Mitigation effect — prompting (BB3-175B)

ValuePrompting yields ~28% mean reduction in toxicity/negative regard for BB3-175B

BaselineBB3-175B baseline toxicity varies by dataset (e.g. AdvPromptSet 29.0%)

AdvPromptSet scale

Value199,403 prompts (full); downsized 10k sample provided

Who Should Care

What To Try In 7 Days

Run the 10k AdvPromptSet sample and HolisticBiasR on your base model to map high-risk subgroups

Apply self-debiasing at inference for smaller base models and compare toxicity rates

Try a few instruction-style prompt templates and measure trade-offs in coherence vs toxicity

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • English-only datasets; results may not generalize to other languages
  • AdvPromptSet inherits annotation noise from Jigsaw sources and can include multiple labels per prompt
  • Automatic toxicity classifiers are imperfect and culturally contextual; human judgments are imperfectly representative
  • Mitigations tested in isolation; combined or tuned methods may behave differently

When Not To Use

  • As the sole safety check for deployed systems; use along with policy, adversarial testing, and human review
  • For non-English models or deployment regions without cultural validation
  • To claim a model is 'bias-free'—the suite surfaces issues but does not prove absence of harm

Failure Modes

  • Mitigations trade off coherence or hide rather than remove bias in some settings
  • Large models may follow prompts but still marginalize subgroups under other prompt styles
  • Classifier measurement noise can mask small changes when baseline toxicity is very low

Core Entities

Models

  • GPT-2
  • OPT
  • BlenderBot 3 (BB3)
  • BLOOM
  • LLaMa

Metrics

  • BiasScore (percentage of subgroups above background)
  • Toxicity % (Perspective API or ToxiGen classifier)
  • Negative regard % (Regard classifier)
  • Perplexity (text-davinci-002)
  • Latency (ms/token)
  • Peak GPU memory (GB)

Datasets

  • AdvPromptSet (new)
  • HolisticBiasR (new)
  • Regard
  • RealToxicityPrompts
  • BOLD
  • ToxiGen v2
  • WikiText-103 (performance sampling)
  • Jigsaw toxicity datasets (source for AdvPromptSet)

Benchmarks

  • ROBBIE (this paper's multi-metric suite)