Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
If your product needs general reasoning, code, or academic skills, English-scale models often suffice; buy or scale English data. If you need accurate local facts or English→Japanese translation, invest in Japanese training tokens.
Summary TLDR
The authors evaluated 35 public LLMs on 19 Japanese and English benchmarks. They find three main ability factors: (1) a general ability that covers most tasks and scales with English compute, (2) a Japanese-specific ability tied to encyclopedic Japan knowledge and English→Japanese translation that scales with Japanese training tokens, and (3) a multilingual arithmetic/code factor. Practically: you usually don't need Japanese pretraining for code, math, or academic tasks; you do need Japanese data to get local facts and better English→Japanese translation.
Problem Statement
Teams building local (non-English) LLMs need evidence about which skills actually require target-language training, which skills transfer from English, and whether local-language datasets scale those skills. The paper answers this by evaluating 35 LLMs across 19 paired Japanese/English benchmarks and analyzing score correlations and principal components.
Main Contribution
Unified evaluation of 35 public English, multilingual, and Japanese LLMs on 19 Japanese and English benchmarks.
PCA-based identification of three interpretable ability factors: general ability, Japanese-specific ability, and arithmetic/code ability.
Evidence that general ability scales with English compute while Japanese-specific ability scales with Japanese token budget.
Robustness checks: similar results with models trained from scratch and with rotated factor analysis.
Key Findings
General (cross-task) ability correlates strongly with English compute budget.
Japanese-specific ability (local facts + English→Japanese translation) scales with Japanese training data.
Many skills transfer across languages for code, math, and academic subjects.
PCA condenses performance into a few factors explaining most variance.
Local training advantages are task-dependent and robust across construction methods.
Results
PC1 vs English compute
PC2 vs Japanese compute
Cross-lingual benchmark correlations
Variance explained by top PCs
Who Should Care
What To Try In 7 Days
Run your important tasks (code, math, domain QA) on a strong English model and a Japanese CPT model; compare.
Measure gaps on local knowledge QA and en→ja translation using NIILC and WMT20-en-ja.
Estimate Japanese token budget (params×JA tokens) and plan data collection if PC2 gains are needed.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Safety, bias, and alignment were not evaluated and may differ with local training (§ Ethics Statement).
- Language-specific token counts were estimated for some models using heuristics (§B.3), which adds uncertainty in budget analysis.
- Benchmarks include translated versions of English tasks; translations can inflate cross-lingual correlations for some benchmarks.
- The study focuses on general-purpose tasks, not domain-specific or safety-critical evaluations.
When Not To Use
- When you need safety, fairness, or bias assessments — this study omits them.
- If your use case relies on instruction-tuned or chat-finetuned models — only base models were evaluated.
- When the target task is domain-specific (medical, legal) — these tasks were not covered.
Failure Modes
- Overinterpreting correlation as causation: observational design may miss confounders.
- Misestimating language budgets when token counts are approximate.
- Relying on BLEU/ROUGE for nuanced quality; n-gram metrics showed low variance and blind spots.
Core Entities
Models
- Llama 3
- Llama 2
- Mistral
- Qwen2
- LLM-jp
- Llama 3 Swallow
- Sarashina2
- CyberAgentLM2
Metrics
- Accuracy
- Exact Match (EM)
- pass@1
- BLEU
- ROUGE-2
- Char F1
Datasets
- MMLU/JMMLU
- GSM8K/MGSM
- HumanEval/JHumanEval
- JEMHopQA
- NIILC
- WMT20 en-ja / ja-en
- JSQuAD/SQuAD2
- XL-Sum
- OpenBookQA
- HellaSwag
- TriviaQA
Benchmarks
- MMLU
- JMMLU
- GSM8K
- MGSM
- HumanEval
- JHumanEval
- JEMHopQA
- NIILC
- WMT20-en-ja
- WMT20-ja-en

