JBBQ — a Japanese multiple-choice QA benchmark to measure and reduce social bias in Japanese LLMs

June 4, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.3

Cost Impact Score

0.4

Citation Count

2

Authors

Hitomi Yanaka, Namgi Han, Ryoma Kumon, Jie Lu, Masashi Takeshita, Ryo Sekizawa, Taisei Kato, Hiromi Arai

Links

Abstract / PDF

Why It Matters For Business

Japanese LLMs can be more accurate as they scale, but they also amplify harmful stereotypes; use JBBQ to measure bias and apply instruction tuning, bias-aware prompts, or CoT before deploying in user-facing systems.

Summary TLDR

The authors build JBBQ, a Japanese multiple-choice QA dataset adapted from the English BBQ to measure social stereotyping in Japanese LLMs. JBBQ covers five culturally adjusted categories (age, disability, gender identity, physical appearance, sexual orientation), with 245 templates and 50,856 question pairs (plus a 912-sample Lite set). Baseline tests on open Japanese models and GPT-4o variants show larger models have higher QA accuracy but also stronger bias signals. Instruction tuning, a prompt that warns about bias (paraP), and chain-of-thought prompting reduce biased outputs and increase selection of 'unknown' when context is insufficient. The dataset and evaluation scripts are intended

Problem Statement

Most bias benchmarks target English and reflect US cultural contexts. Japanese LLMs lack a culturally adapted QA benchmark to measure social stereotyping. Without a localized test, models may appear safe while still reflecting harmful Japanese stereotypes.

Main Contribution

JBBQ: a Japanese multiple-choice QA benchmark for social bias, adapted from BBQ and culturally adjusted

Dataset release: 245 templates, 50,856 question pairs; JBBQ-Lite: 912 pairs for quick evaluation

Baseline analysis across open Japanese LLMs and GPT-4o variants showing trade-offs between accuracy and bias

Empirical tests showing instruction tuning, a bias-warning prompt (paraP), and Chain-of-Thought (CoT) reduce biased outputs and increase 'unknown' answers

Key Findings

Larger model size raises both accuracy and bias.

NumbersAcc Avg: 48.6 (13B INST) → 82.7 (SWL3-70B-INST); Diff-bias Amb: +23.1 for SWL3-70B-INST

Instruction tuning helps models pick 'unknown' for ambiguous questions.

NumbersInstruction-tuned models show higher Acc Amb and reduced OoC; paraP increased Acc Amb from 72.2→95.5 for SWL3-70B-INST

Prompting with a bias warning (paraP) raises correct 'unknown' selections but can hurt disambiguated accuracy.

NumbersSWL3-70B-INST Acc Amb 72.2→95.5 with paraP; Acc Dis 93.2→82.7

Chain-of-Thought prompts improve accuracy and reduce bias gap by forcing evidence output.

NumbersCoT raised many models' Acc Avg (e.g., SWL3-70B-INST Acc Avg 96.6 in CoT vs 82.7 baseline) and narrowed Amb vs Dis gap

Models can detect biased answers but perform worse on bias-detection than on QA.

NumbersSWL3-70B-INST bias-detection Acc 59.3 (3-shot basicP) vs QA Acc Avg 82.7

Results

Accuracy

Value82.7

Baseline48.6 (13B inst. examples)

Diff-bias (Ambiguous)

Value+23.1

Baseline+0 to +7 (smaller models)

Accuracy

Value95.5

Baseline72.2 (basicP)

Accuracy

Value96.6

Baseline82.7

Who Should Care

What To Try In 7 Days

Run JBBQ-Lite on candidate Japanese LLMs to measure baseline bias scores

Compare basic vs paraP prompts and enable CoT to see if evidence output reduces biased answers

Prefer instruction-tuned checkpoints when the task requires cautious answers on ambiguous inputs

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • JBBQ covers only five categories (age, disability, gender identity, physical appearance, sexual orientation); nationality, race, religion, socioeconomic status were excluded
  • No intersectional templates (e.g., gender × race) were created
  • Some reasoning outputs from CoT are inconsistent, so evidence extraction is not fully reliable

When Not To Use

  • As a comprehensive audit for race or nationality biases (those categories were excluded)
  • As training data to improve model behavior (authors advise against using JBBQ to train generative models)
  • As the only safety check — combine with other methods and human review

Failure Modes

  • Models produce plausible but incorrect reasoning steps under CoT (unfaithful explanations)
  • Prompt conflicts (e.g., paraP vs bias-detection objective) can reduce task performance
  • Order bias: models still prefer earlier choices despite balanced construction

Core Entities

Models

  • LLMJP
  • LLMJP-INST
  • SWL2-13B
  • SWL2-13B-INST
  • SWL2-70B
  • SWL2-70B-INST
  • SWL3-70B
  • SWL3-70B-INST
  • GPT4O
  • GPT4O-MINI

Metrics

  • Accuracy
  • Diff-bias
  • Bias score (BS)
  • Acc. Diff.
  • OoC ratio

Datasets

  • JBBQ
  • JBBQ-Lite
  • BBQ
  • CBBQ
  • KoBBQ

Benchmarks

  • JBBQ
  • BBQ