BiasLab: a multilingual, dual-framing toolkit for robust output-level bias audits

January 11, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

William Guey, Wei Zhang, Pei-Luen Patrick Rau, Pierrick Bougault, Vitor D. de Moura, Bertan Ucar, Jose O. Gomes

Links

Abstract / PDF

Why It Matters For Business

BiasLab gives teams a repeatable, multilingual way to compare model outputs for directional bias, helping pick safer models and flag risky behaviors before deployment.

Summary TLDR

BiasLab is an open-source, model-agnostic toolbox for measuring output-level (extrinsic) bias in large language models. It uses strictly mirrored prompt pairs (affirmative vs reverse framing), randomized multilingual wrapper prompts, a forced-choice Likert response format, and an LLM-based judge to normalize outputs. Scores are polarity-aligned and aggregated into mean bias, neutrality rate, and effect-size metrics. The framework emphasizes robustness to prompt wording and cross-lingual comparison, but it measures only output behavior, relies on an LLM judge, and uses a constrained choice format that limits realism.

Problem Statement

Existing bias audits are sensitive to prompt wording, often English-only, and use heterogeneous output formats that block fair cross-model comparison. Practitioners lack a standardized, language-inclusive method to measure directional output bias reliably across models and prompt variants.

Main Contribution

A dual-framing probe design that creates strictly mirrored affirmative and reverse prompts by deterministic target substitution to isolate directional preference.

A multilingual probe pipeline with randomized prefix/suffix wrappers to test robustness to prompt wording across languages.

A forced-choice Likert response format plus an LLM-based judge that normalizes diverse model outputs into agreed categories, enabling quantitative aggregation.

A polarity-aligned scoring and reporting suite that outputs mean bias score, neutrality rate, Cohen's d, t-test, and visualizations for per-language and cross-language comparison.

Open-source release with code, live demo, and reproducible artifacts for institutional auditing.

Key Findings

Dual-framing with exact target substitution isolates directional bias from wording differences.

Randomized multilingual wrappers reduce sensitivity to single-prompt artifacts by sampling multiple prefix/suffix variants.

NumbersUses N robustness iterations per language

Forced-choice Likert plus an LLM-based judge maps heterogeneous outputs into a unified ordinal score (-2..+2).

NumbersOrdinal mapping: Strongly agree=+2 ... Strongly disagree=-2

BiasLab reports neutrality rate to distinguish balanced outputs from abstention or refusal.

NumbersNeutrality rate = proportion of zero outcomes

Framework is open-source and reproducible with artifacts and visualizations available online.

Limitations include: extrinsic-only scope, forced-choice realism limits, translation drift, LLM-judge measurement risk, and endpoint versioning instability.

Who Should Care

What To Try In 7 Days

Run BiasLab on 3 business-critical prompt pairs (English + one key customer language) to compare vendor models.

Check neutrality rates to spot refusal vs genuine balance for each model.

Inspect judge-normalized labels and 10 raw outputs per model to validate judge behavior and translation quality.

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Measures extrinsic (output) bias only; does not diagnose internal model causes.
  • Forced-choice Likert improves comparability but misses subtle harms in free text.
  • Automated multilingual probe generation can introduce translation drift or semantic asymmetry.
  • LLM-based judge normalization may inject labeling bias and depends on judge choice.
  • Remote model endpoints can change over time; versioning metadata may be incomplete.
  • Results generalize only to tested topic-target pairs; broader conclusions need larger topic libraries.

When Not To Use

  • When you need to trace bias causes to training data or embeddings (intrinsic analysis required).
  • When assessing subtle open-ended harms like stereotyping in long-form outputs.
  • When you cannot guarantee model endpoint stability or timestamped provenance.

Failure Modes

  • Judge mislabels hedged or culturally idiomatic responses, skewing bias estimates.
  • Probe translation mismatches create artificial asymmetries across languages.
  • High neutrality rates hide safety-triggered refusals, not genuine neutrality.
  • Provider updates change model behavior after evaluation, breaking comparability.

Core Entities

Metrics

  • mean bias score
  • neutrality rate
  • Cohen's d
  • one-sample t-test (t,p)