Overview
The dataset is well-documented and experiments are comprehensive; however, it omits four sensitive categories and some evaluation code is not released, limiting immediate plug-and-play use.
Citations2
Evidence Strength0.85
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 30%
Why It Matters For Business
Japanese LLMs can be more accurate as they scale, but they also amplify harmful stereotypes; use JBBQ to measure bias and apply instruction tuning, bias-aware prompts, or CoT before deploying in user-facing systems.
Who Should Care
Summary TLDR
The authors build JBBQ, a Japanese multiple-choice QA dataset adapted from the English BBQ to measure social stereotyping in Japanese LLMs. JBBQ covers five culturally adjusted categories (age, disability, gender identity, physical appearance, sexual orientation), with 245 templates and 50,856 question pairs (plus a 912-sample Lite set). Baseline tests on open Japanese models and GPT-4o variants show larger models have higher QA accuracy but also stronger bias signals. Instruction tuning, a prompt that warns about bias (paraP), and chain-of-thought prompting reduce biased outputs and increase selection of 'unknown' when context is insufficient. The dataset and evaluation scripts are intended
Problem Statement
Most bias benchmarks target English and reflect US cultural contexts. Japanese LLMs lack a culturally adapted QA benchmark to measure social stereotyping. Without a localized test, models may appear safe while still reflecting harmful Japanese stereotypes.
Main Contribution
JBBQ: a Japanese multiple-choice QA benchmark for social bias, adapted from BBQ and culturally adjusted
Dataset release: 245 templates, 50,856 question pairs; JBBQ-Lite: 912 pairs for quick evaluation
Key Findings
Larger model size raises both accuracy and bias.
Instruction tuning helps models pick 'unknown' for ambiguous questions.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 82.7 | 48.6 (13B inst. examples) | +34.1 | SWL3-70B-INST on full JBBQ, 3-shot basicP | SWL3-70B-INST Acc Avg 82.7 (Table 3) | Table 3 |
| Diff-bias (Ambiguous) | +23.1 | +0 to +7 (smaller models) | substantially higher | SWL3-70B-INST, 3-shot basicP | High ambiguous diff-bias indicates biased incorrect answers (Table 3) | Table 3 |
What To Try In 7 Days
Run JBBQ-Lite on candidate Japanese LLMs to measure baseline bias scores
Compare basic vs paraP prompts and enable CoT to see if evidence output reduces biased answers
Prefer instruction-tuned checkpoints when the task requires cautious answers on ambiguous inputs
Reproducibility
Data URLs
Risks & Boundaries
Limitations
JBBQ covers only five categories (age, disability, gender identity, physical appearance, sexual orientation); nationality, race, religion, socioeconomic status were excluded
No intersectional templates (e.g., gender × race) were created
When Not To Use
As a comprehensive audit for race or nationality biases (those categories were excluded)
As training data to improve model behavior (authors advise against using JBBQ to train generative models)
Failure Modes
Models produce plausible but incorrect reasoning steps under CoT (unfaithful explanations)
Prompt conflicts (e.g., paraP vs bias-detection objective) can reduce task performance

