Local training helps local knowledge and translation; many reasoning and code skills transfer from English

December 19, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Koshiro Saito, Sakae Mizuki, Masanari Ohi, Taishi Nakamura, Taihei Shiotani, Koki Maeda, Youmi Ma, Kakeru Hattori, Kazuki Fujii, Takumi Okamoto, Shigeki Ishida, Hiroya Takamura, Rio Yokota, Naoaki Okazaki

Links

Abstract / PDF

Why It Matters For Business

If your product needs general reasoning, code, or academic skills, English-scale models often suffice; buy or scale English data. If you need accurate local facts or English→Japanese translation, invest in Japanese training tokens.

Summary TLDR

The authors evaluated 35 public LLMs on 19 Japanese and English benchmarks. They find three main ability factors: (1) a general ability that covers most tasks and scales with English compute, (2) a Japanese-specific ability tied to encyclopedic Japan knowledge and English→Japanese translation that scales with Japanese training tokens, and (3) a multilingual arithmetic/code factor. Practically: you usually don't need Japanese pretraining for code, math, or academic tasks; you do need Japanese data to get local facts and better English→Japanese translation.

Problem Statement

Teams building local (non-English) LLMs need evidence about which skills actually require target-language training, which skills transfer from English, and whether local-language datasets scale those skills. The paper answers this by evaluating 35 LLMs across 19 paired Japanese/English benchmarks and analyzing score correlations and principal components.

Main Contribution

Unified evaluation of 35 public English, multilingual, and Japanese LLMs on 19 Japanese and English benchmarks.

PCA-based identification of three interpretable ability factors: general ability, Japanese-specific ability, and arithmetic/code ability.

Evidence that general ability scales with English compute while Japanese-specific ability scales with Japanese token budget.

Robustness checks: similar results with models trained from scratch and with rotated factor analysis.

Key Findings

General (cross-task) ability correlates strongly with English compute budget.

NumbersPearson ρ = 0.916 between English ND and PC1

Japanese-specific ability (local facts + English→Japanese translation) scales with Japanese training data.

NumbersPearson ρ = 0.779 between Japanese ND and PC2

Many skills transfer across languages for code, math, and academic subjects.

NumbersCross-lingual correlations: HumanEval vs JHumanEval 0.98; GSM8K vs MGSM 0.94; MMLU vs JMMLU 0.91

PCA condenses performance into a few factors explaining most variance.

NumbersPC1–PC4 cumulative variance = 90.8% (PC1 = 65.2%)

Local training advantages are task-dependent and robust across construction methods.

NumbersSame factor structure found using only scratch-trained models and Promax rotation

Results

PC1 vs English compute

ValuePearson ρ = 0.916

PC2 vs Japanese compute

ValuePearson ρ = 0.779

Cross-lingual benchmark correlations

ValueHumanEval vs JHumanEval ρ≈0.98; GSM8K vs MGSM ρ≈0.94; MMLU vs JMMLU ρ≈0.91

Variance explained by top PCs

ValuePC1–PC4 cumulative = 90.8% (PC1 = 65.2%)

Who Should Care

What To Try In 7 Days

Run your important tasks (code, math, domain QA) on a strong English model and a Japanese CPT model; compare.

Measure gaps on local knowledge QA and en→ja translation using NIILC and WMT20-en-ja.

Estimate Japanese token budget (params×JA tokens) and plan data collection if PC2 gains are needed.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Safety, bias, and alignment were not evaluated and may differ with local training (§ Ethics Statement).
  • Language-specific token counts were estimated for some models using heuristics (§B.3), which adds uncertainty in budget analysis.
  • Benchmarks include translated versions of English tasks; translations can inflate cross-lingual correlations for some benchmarks.
  • The study focuses on general-purpose tasks, not domain-specific or safety-critical evaluations.

When Not To Use

  • When you need safety, fairness, or bias assessments — this study omits them.
  • If your use case relies on instruction-tuned or chat-finetuned models — only base models were evaluated.
  • When the target task is domain-specific (medical, legal) — these tasks were not covered.

Failure Modes

  • Overinterpreting correlation as causation: observational design may miss confounders.
  • Misestimating language budgets when token counts are approximate.
  • Relying on BLEU/ROUGE for nuanced quality; n-gram metrics showed low variance and blind spots.

Core Entities

Models

  • Llama 3
  • Llama 2
  • Mistral
  • Qwen2
  • LLM-jp
  • Llama 3 Swallow
  • Sarashina2
  • CyberAgentLM2

Metrics

  • Accuracy
  • Exact Match (EM)
  • pass@1
  • BLEU
  • ROUGE-2
  • Char F1

Datasets

  • MMLU/JMMLU
  • GSM8K/MGSM
  • HumanEval/JHumanEval
  • JEMHopQA
  • NIILC
  • WMT20 en-ja / ja-en
  • JSQuAD/SQuAD2
  • XL-Sum
  • OpenBookQA
  • HellaSwag
  • TriviaQA

Benchmarks

  • MMLU
  • JMMLU
  • GSM8K
  • MGSM
  • HumanEval
  • JHumanEval
  • JEMHopQA
  • NIILC
  • WMT20-en-ja
  • WMT20-ja-en