Overview
Production Readiness
0.6
Novelty Score
0.3
Cost Impact Score
0.4
Citation Count
2
Why It Matters For Business
Japanese LLMs can be more accurate as they scale, but they also amplify harmful stereotypes; use JBBQ to measure bias and apply instruction tuning, bias-aware prompts, or CoT before deploying in user-facing systems.
Summary TLDR
The authors build JBBQ, a Japanese multiple-choice QA dataset adapted from the English BBQ to measure social stereotyping in Japanese LLMs. JBBQ covers five culturally adjusted categories (age, disability, gender identity, physical appearance, sexual orientation), with 245 templates and 50,856 question pairs (plus a 912-sample Lite set). Baseline tests on open Japanese models and GPT-4o variants show larger models have higher QA accuracy but also stronger bias signals. Instruction tuning, a prompt that warns about bias (paraP), and chain-of-thought prompting reduce biased outputs and increase selection of 'unknown' when context is insufficient. The dataset and evaluation scripts are intended
Problem Statement
Most bias benchmarks target English and reflect US cultural contexts. Japanese LLMs lack a culturally adapted QA benchmark to measure social stereotyping. Without a localized test, models may appear safe while still reflecting harmful Japanese stereotypes.
Main Contribution
JBBQ: a Japanese multiple-choice QA benchmark for social bias, adapted from BBQ and culturally adjusted
Dataset release: 245 templates, 50,856 question pairs; JBBQ-Lite: 912 pairs for quick evaluation
Baseline analysis across open Japanese LLMs and GPT-4o variants showing trade-offs between accuracy and bias
Empirical tests showing instruction tuning, a bias-warning prompt (paraP), and Chain-of-Thought (CoT) reduce biased outputs and increase 'unknown' answers
Key Findings
Larger model size raises both accuracy and bias.
Instruction tuning helps models pick 'unknown' for ambiguous questions.
Prompting with a bias warning (paraP) raises correct 'unknown' selections but can hurt disambiguated accuracy.
Chain-of-Thought prompts improve accuracy and reduce bias gap by forcing evidence output.
Models can detect biased answers but perform worse on bias-detection than on QA.
Results
Accuracy
Diff-bias (Ambiguous)
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run JBBQ-Lite on candidate Japanese LLMs to measure baseline bias scores
Compare basic vs paraP prompts and enable CoT to see if evidence output reduces biased answers
Prefer instruction-tuned checkpoints when the task requires cautious answers on ambiguous inputs
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- JBBQ covers only five categories (age, disability, gender identity, physical appearance, sexual orientation); nationality, race, religion, socioeconomic status were excluded
- No intersectional templates (e.g., gender × race) were created
- Some reasoning outputs from CoT are inconsistent, so evidence extraction is not fully reliable
When Not To Use
- As a comprehensive audit for race or nationality biases (those categories were excluded)
- As training data to improve model behavior (authors advise against using JBBQ to train generative models)
- As the only safety check — combine with other methods and human review
Failure Modes
- Models produce plausible but incorrect reasoning steps under CoT (unfaithful explanations)
- Prompt conflicts (e.g., paraP vs bias-detection objective) can reduce task performance
- Order bias: models still prefer earlier choices despite balanced construction
Core Entities
Models
- LLMJP
- LLMJP-INST
- SWL2-13B
- SWL2-13B-INST
- SWL2-70B
- SWL2-70B-INST
- SWL3-70B
- SWL3-70B-INST
- GPT4O
- GPT4O-MINI
Metrics
- Accuracy
- Diff-bias
- Bias score (BS)
- Acc. Diff.
- OoC ratio
Datasets
- JBBQ
- JBBQ-Lite
- BBQ
- CBBQ
- KoBBQ
Benchmarks
- JBBQ
- BBQ

