Local training helps local knowledge and translation; many reasoning and code skills transfer from English

Overview

Decision SnapshotReady For Pilot

The analysis uses a broad, publicly verifiable model set and many benchmarks; correlations and PCA give robust, actionable signals about which abilities need local data.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 60%

Authors

Koshiro Saito, Sakae Mizuki, Masanari Ohi, Taishi Nakamura, Taihei Shiotani, Koki Maeda, Youmi Ma, Kakeru Hattori, Kazuki Fujii, Takumi Okamoto, Shigeki Ishida, Hiroya Takamura, Rio Yokota, Naoaki Okazaki

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product needs general reasoning, code, or academic skills, English-scale models often suffice; buy or scale English data. If you need accurate local facts or English→Japanese translation, invest in Japanese training tokens.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

The authors evaluated 35 public LLMs on 19 Japanese and English benchmarks. They find three main ability factors: (1) a general ability that covers most tasks and scales with English compute, (2) a Japanese-specific ability tied to encyclopedic Japan knowledge and English→Japanese translation that scales with Japanese training tokens, and (3) a multilingual arithmetic/code factor. Practically: you usually don't need Japanese pretraining for code, math, or academic tasks; you do need Japanese data to get local facts and better English→Japanese translation.

Problem Statement

Teams building local (non-English) LLMs need evidence about which skills actually require target-language training, which skills transfer from English, and whether local-language datasets scale those skills. The paper answers this by evaluating 35 LLMs across 19 paired Japanese/English benchmarks and analyzing score correlations and principal components.

Main Contribution

Unified evaluation of 35 public English, multilingual, and Japanese LLMs on 19 Japanese and English benchmarks.

PCA-based identification of three interpretable ability factors: general ability, Japanese-specific ability, and arithmetic/code ability.

Key Findings

General (cross-task) ability correlates strongly with English compute budget.

NumbersPearson ρ = 0.916 between English ND and PC1

Practical UseFor broad gains across many tasks, prioritize English-scale compute/data rather than heavy target-language pretraining.

Evidence RefFigure 6; §4.4

Japanese-specific ability (local facts + English→Japanese translation) scales with Japanese training data.

NumbersPearson ρ = 0.779 between Japanese ND and PC2

Practical UseTo improve local knowledge and English→Japanese translation, invest directly in Japanese training tokens.

Evidence RefFigure 7; §4.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
PC1 vs English compute	Pearson ρ = 0.916	—	—	All evaluated models (n=27 with budgets)	PC1 (general ability) scales with English ND	Figure 6; §4.4
PC2 vs Japanese compute	Pearson ρ = 0.779	—	—	Models with estimated JA tokens (n=25)	PC2 (Japanese ability) scales with Japanese ND	Figure 7; §4.4

What To Try In 7 Days

Run your important tasks (code, math, domain QA) on a strong English model and a Japanese CPT model; compare.

Measure gaps on local knowledge QA and en→ja translation using NIILC and WMT20-en-ja.

Estimate Japanese token budget (params×JA tokens) and plan data collection if PC2 gains are needed.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/swallow-llm/swallow-evaluation https://zenodo.org/records/10256836

Data URLs

https://doi.org/10.5281/zenodo.13160661

Risks & Boundaries

Limitations

Safety, bias, and alignment were not evaluated and may differ with local training (§ Ethics Statement).

Language-specific token counts were estimated for some models using heuristics (§B.3), which adds uncertainty in budget analysis.

When Not To Use

When you need safety, fairness, or bias assessments — this study omits them.

If your use case relies on instruction-tuned or chat-finetuned models — only base models were evaluated.

Failure Modes

Overinterpreting correlation as causation: observational design may miss confounders.

Misestimating language budgets when token counts are approximate.

Core Entities

Models

Llama 3Llama 2MistralQwen2LLM-jpLlama 3 SwallowSarashina2CyberAgentLM2

Metrics

AccuracyExact Match (EM)pass@1BLEUROUGE-2Char F1

Datasets

MMLU/JMMLUGSM8K/MGSMHumanEval/JHumanEvalJEMHopQANIILCWMT20 en-ja / ja-enJSQuAD/SQuAD2XL-SumOpenBookQAHellaSwagTriviaQA

Benchmarks

MMLUJMMLUGSM8KMGSMHumanEvalJHumanEvalJEMHopQANIILCWMT20-en-jaWMT20-ja-en

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

General (cross-task) ability correlates strongly with English compute budget.

Japanese-specific ability (local facts + English→Japanese translation) scales with Japanese training data.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

Key finding

Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

Key finding

DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

Key finding

LLM judges favor 'new' and 'expert' labels but never admit it.

Key finding

Confuse the judge: a black-box method that labels LLM evaluations as high or low uncertainty

Key finding