JBBQ — a Japanese multiple-choice QA benchmark to measure and reduce social bias in Japanese LLMs

Overview

Decision SnapshotReady For Pilot

The dataset is well-documented and experiments are comprehensive; however, it omits four sensitive categories and some evaluation code is not released, limiting immediate plug-and-play use.

Citations2

Evidence Strength0.85

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 30%

Authors

Hitomi Yanaka, Namgi Han, Ryoma Kumon, Jie Lu, Masashi Takeshita, Ryo Sekizawa, Taisei Kato, Hiromi Arai

Links

Abstract / PDF / Data

Why It Matters For Business

Japanese LLMs can be more accurate as they scale, but they also amplify harmful stereotypes; use JBBQ to measure bias and apply instruction tuning, bias-aware prompts, or CoT before deploying in user-facing systems.

Who Should Care

Product Manager ML Engineer CTO Data Scientist Founder

Summary TLDR

The authors build JBBQ, a Japanese multiple-choice QA dataset adapted from the English BBQ to measure social stereotyping in Japanese LLMs. JBBQ covers five culturally adjusted categories (age, disability, gender identity, physical appearance, sexual orientation), with 245 templates and 50,856 question pairs (plus a 912-sample Lite set). Baseline tests on open Japanese models and GPT-4o variants show larger models have higher QA accuracy but also stronger bias signals. Instruction tuning, a prompt that warns about bias (paraP), and chain-of-thought prompting reduce biased outputs and increase selection of 'unknown' when context is insufficient. The dataset and evaluation scripts are intended

Problem Statement

Most bias benchmarks target English and reflect US cultural contexts. Japanese LLMs lack a culturally adapted QA benchmark to measure social stereotyping. Without a localized test, models may appear safe while still reflecting harmful Japanese stereotypes.

Main Contribution

JBBQ: a Japanese multiple-choice QA benchmark for social bias, adapted from BBQ and culturally adjusted

Dataset release: 245 templates, 50,856 question pairs; JBBQ-Lite: 912 pairs for quick evaluation

Key Findings

Larger model size raises both accuracy and bias.

NumbersAcc Avg: 48.6 (13B INST) → 82.7 (SWL3-70B-INST); Diff-bias Amb: +23.1 for SWL3-70B-INST

Practical UseExpect better QA performance from bigger Japanese LLMs but also stronger tendency to produce biased answers; test models on JBBQ before deployment.

Evidence RefTable 3, Figure 2; Table 4

Instruction tuning helps models pick 'unknown' for ambiguous questions.

NumbersInstruction-tuned models show higher Acc Amb and reduced OoC; paraP increased Acc Amb from 72.2→95.5 for SWL3-70B-INST

Practical UseApply instruction tuning or instruction-tuned checkpoints to reduce confident stereotyping on unclear inputs.

Evidence RefFigure 1; Table 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	82.7	48.6 (13B inst. examples)	+34.1	SWL3-70B-INST on full JBBQ, 3-shot basicP	SWL3-70B-INST Acc Avg 82.7 (Table 3)	Table 3
Diff-bias (Ambiguous)	+23.1	+0 to +7 (smaller models)	substantially higher	SWL3-70B-INST, 3-shot basicP	High ambiguous diff-bias indicates biased incorrect answers (Table 3)	Table 3

What To Try In 7 Days

Run JBBQ-Lite on candidate Japanese LLMs to measure baseline bias scores

Compare basic vs paraP prompts and enable CoT to see if evidence output reduces biased answers

Prefer instruction-tuned checkpoints when the task requires cautious answers on ambiguous inputs

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://github.com/ynklab/JBBQ_data

Risks & Boundaries

Limitations

JBBQ covers only five categories (age, disability, gender identity, physical appearance, sexual orientation); nationality, race, religion, socioeconomic status were excluded

No intersectional templates (e.g., gender × race) were created

When Not To Use

As a comprehensive audit for race or nationality biases (those categories were excluded)

As training data to improve model behavior (authors advise against using JBBQ to train generative models)

Failure Modes

Models produce plausible but incorrect reasoning steps under CoT (unfaithful explanations)

Prompt conflicts (e.g., paraP vs bias-detection objective) can reduce task performance

Core Entities

Models

LLMJPLLMJP-INSTSWL2-13BSWL2-13B-INSTSWL2-70BSWL2-70B-INSTSWL3-70BSWL3-70B-INSTGPT4OGPT4O-MINI

Metrics

AccuracyDiff-biasBias score (BS)Acc. Diff.OoC ratio

Datasets

JBBQJBBQ-LiteBBQCBBQKoBBQ

Benchmarks

JBBQBBQ