Local training helps local knowledge and translation; many reasoning and code skills transfer from English

December 19, 20247 min

Overview

Decision SnapshotReady For Pilot

The analysis uses a broad, publicly verifiable model set and many benchmarks; correlations and PCA give robust, actionable signals about which abilities need local data.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 60%

Authors

Koshiro Saito, Sakae Mizuki, Masanari Ohi, Taishi Nakamura, Taihei Shiotani, Koki Maeda, Youmi Ma, Kakeru Hattori, Kazuki Fujii, Takumi Okamoto, Shigeki Ishida, Hiroya Takamura, Rio Yokota, Naoaki Okazaki

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product needs general reasoning, code, or academic skills, English-scale models often suffice; buy or scale English data. If you need accurate local facts or English→Japanese translation, invest in Japanese training tokens.

Who Should Care

Summary TLDR

The authors evaluated 35 public LLMs on 19 Japanese and English benchmarks. They find three main ability factors: (1) a general ability that covers most tasks and scales with English compute, (2) a Japanese-specific ability tied to encyclopedic Japan knowledge and English→Japanese translation that scales with Japanese training tokens, and (3) a multilingual arithmetic/code factor. Practically: you usually don't need Japanese pretraining for code, math, or academic tasks; you do need Japanese data to get local facts and better English→Japanese translation.

Problem Statement

Teams building local (non-English) LLMs need evidence about which skills actually require target-language training, which skills transfer from English, and whether local-language datasets scale those skills. The paper answers this by evaluating 35 LLMs across 19 paired Japanese/English benchmarks and analyzing score correlations and principal components.

Main Contribution

Unified evaluation of 35 public English, multilingual, and Japanese LLMs on 19 Japanese and English benchmarks.

PCA-based identification of three interpretable ability factors: general ability, Japanese-specific ability, and arithmetic/code ability.

Key Findings

General (cross-task) ability correlates strongly with English compute budget.

NumbersPearson ρ = 0.916 between English ND and PC1

Practical UseFor broad gains across many tasks, prioritize English-scale compute/data rather than heavy target-language pretraining.

Evidence RefFigure 6; §4.4

Japanese-specific ability (local facts + English→Japanese translation) scales with Japanese training data.

NumbersPearson ρ = 0.779 between Japanese ND and PC2

Practical UseTo improve local knowledge and English→Japanese translation, invest directly in Japanese training tokens.

Evidence RefFigure 7; §4.4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
PC1 vs English computePearson ρ = 0.916All evaluated models (n=27 with budgets)PC1 (general ability) scales with English NDFigure 6; §4.4
PC2 vs Japanese computePearson ρ = 0.779Models with estimated JA tokens (n=25)PC2 (Japanese ability) scales with Japanese NDFigure 7; §4.4

What To Try In 7 Days

Run your important tasks (code, math, domain QA) on a strong English model and a Japanese CPT model; compare.

Measure gaps on local knowledge QA and en→ja translation using NIILC and WMT20-en-ja.

Estimate Japanese token budget (params×JA tokens) and plan data collection if PC2 gains are needed.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Safety, bias, and alignment were not evaluated and may differ with local training (§ Ethics Statement).

Language-specific token counts were estimated for some models using heuristics (§B.3), which adds uncertainty in budget analysis.

When Not To Use

When you need safety, fairness, or bias assessments — this study omits them.

If your use case relies on instruction-tuned or chat-finetuned models — only base models were evaluated.

Failure Modes

Overinterpreting correlation as causation: observational design may miss confounders.

Misestimating language budgets when token counts are approximate.

Core Entities

Models

Llama 3Llama 2MistralQwen2LLM-jpLlama 3 SwallowSarashina2CyberAgentLM2

Metrics

AccuracyExact Match (EM)pass@1BLEUROUGE-2Char F1

Datasets

MMLU/JMMLUGSM8K/MGSMHumanEval/JHumanEvalJEMHopQANIILCWMT20 en-ja / ja-enJSQuAD/SQuAD2XL-SumOpenBookQAHellaSwagTriviaQA

Benchmarks

MMLUJMMLUGSM8KMGSMHumanEvalJHumanEvalJEMHopQANIILCWMT20-en-jaWMT20-ja-en