Overview
The analysis uses a broad, publicly verifiable model set and many benchmarks; correlations and PCA give robust, actionable signals about which abilities need local data.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 80%
Novelty: 60%
Why It Matters For Business
If your product needs general reasoning, code, or academic skills, English-scale models often suffice; buy or scale English data. If you need accurate local facts or English→Japanese translation, invest in Japanese training tokens.
Who Should Care
Summary TLDR
The authors evaluated 35 public LLMs on 19 Japanese and English benchmarks. They find three main ability factors: (1) a general ability that covers most tasks and scales with English compute, (2) a Japanese-specific ability tied to encyclopedic Japan knowledge and English→Japanese translation that scales with Japanese training tokens, and (3) a multilingual arithmetic/code factor. Practically: you usually don't need Japanese pretraining for code, math, or academic tasks; you do need Japanese data to get local facts and better English→Japanese translation.
Problem Statement
Teams building local (non-English) LLMs need evidence about which skills actually require target-language training, which skills transfer from English, and whether local-language datasets scale those skills. The paper answers this by evaluating 35 LLMs across 19 paired Japanese/English benchmarks and analyzing score correlations and principal components.
Main Contribution
Unified evaluation of 35 public English, multilingual, and Japanese LLMs on 19 Japanese and English benchmarks.
PCA-based identification of three interpretable ability factors: general ability, Japanese-specific ability, and arithmetic/code ability.
Key Findings
General (cross-task) ability correlates strongly with English compute budget.
Japanese-specific ability (local facts + English→Japanese translation) scales with Japanese training data.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| PC1 vs English compute | Pearson ρ = 0.916 | — | — | All evaluated models (n=27 with budgets) | PC1 (general ability) scales with English ND | Figure 6; §4.4 |
| PC2 vs Japanese compute | Pearson ρ = 0.779 | — | — | Models with estimated JA tokens (n=25) | PC2 (Japanese ability) scales with Japanese ND | Figure 7; §4.4 |
What To Try In 7 Days
Run your important tasks (code, math, domain QA) on a strong English model and a Japanese CPT model; compare.
Measure gaps on local knowledge QA and en→ja translation using NIILC and WMT20-en-ja.
Estimate Japanese token budget (params×JA tokens) and plan data collection if PC2 gains are needed.
Reproducibility
Risks & Boundaries
Limitations
Safety, bias, and alignment were not evaluated and may differ with local training (§ Ethics Statement).
Language-specific token counts were estimated for some models using heuristics (§B.3), which adds uncertainty in budget analysis.
When Not To Use
When you need safety, fairness, or bias assessments — this study omits them.
If your use case relies on instruction-tuned or chat-finetuned models — only base models were evaluated.
Failure Modes
Overinterpreting correlation as causation: observational design may miss confounders.
Misestimating language budgets when token counts are approximate.

