Multilingual LLMs Papers — Parsed & Scored for Practitioners

Large multilingual evaluation shows ChatGPT is strong at grammar but weak at multilingual semantic tasks

0.30

0.40

0.60

51

ChatGPT zero-shot is good for quick grammar-level tasks (like POS tagging) but not reliable for production semantic tasks across many languages; invest in task- and language-specific models for higher accuracy and lower operational risk.

Key finding

ChatGPT generally underperforms supervised task-specific models on semantic multilingual tasks.

Numbers: XNLI avg acc: ChatGPT (en) 57.0% vs mT5-XXL 87.1%

SeaLLMs: language models tuned and tokenized for Southeast Asian languages

0.70

0.80

7

SeaLLMs let companies offer cheaper, smaller models that serve Southeast Asian languages better than general English-centric models, improving UX and reducing API costs for these markets.

Key finding

Vocabulary expansion sharply reduced token cost for non‑Latin SEA scripts.

Numbers: Thai token ratio improved from 9.09→1.87 (SeaLLM's, Table 1)

Use translation + instruction tuning to make English LLMs much better in six non‑English languages

0.60

7

You can upgrade an English LLM to handle multiple non-English languages without huge data or retraining costs by adding parallel translation tasks and translated instructions; this saves time and compute compared to building language-specific models from scratch.

Key finding

Cross-lingual instruction tuning (x-LLaMA) improves non-English QA accuracy a lot versus an English-only instruction model.

Numbers: Average +27.83% answer accuracy across six languages (XQUAD & MLQA)

You can adapt LLaMA to other languages cheaply: vocab changes often unnecessary

0.60

0.50

0.80

6

You can cheaply adapt an English-trained LLM to other languages: keep the original tokenizer, do modest further pretraining, and invest in instruction tuning to get usable responses without massive compute.

Key finding

Extending the tokenizer vocabulary can hurt transfer at small-to-moderate retraining scales.

Numbers: 0.5B vs 30B tokens; LLM-Eval AVG 1.562 (LLaMA 0.5B pretrain) vs 1.244 (Chinese LLaMA) (Table 1)

TOWER: open LLaMA-2 based multilingual models tuned for translation workflows and competitive with closed LLMs

0.70

0.30

0.60

6

You can run an open 13B model that matches or beats other open models for translation and outperforms closed models on NER and post-editing in some settings, reducing vendor lock-in and inference cost while enabling customization.

Key finding

TOWERINSTRUCT-13B is the best open model for translation and is close to GPT-4 on standard benchmarks.

Numbers: FLORES-200 COMET-22: TOWERINSTRUCT13B 88.88 vs GPT-4 89.13

MoZIP: a 3-part multilingual benchmark plus an IP-tuned 7B model to test how well LLMs handle patent and IP tasks

0.30

0.60

0.40

3

IP tasks need factual, language-specific understanding; MoZIP and MoZi show that domain-tuning helps but general LLMs still miss facts—so verify outputs for legal or IP decisions.

Key finding

Domain tuning on patents and IP instructions raises performance versus the BLOOMZ-7b base.

Numbers: IPQuiz average: MoZi 39.4% vs BLOOMZ-7b 29.3% (+10.1 pp)

TigerBot: an openly released 7B–180B multilingual LLM family with emphasis on Chinese, low training cost, long context and practical tools

0.80

0.50

0.70

2

TigerBot gives better Chinese and competitive English performance with practical tooling (APIs, plugins, long-context, function calling) and low claimed training cost, making it useful for production chat, document QA, and device embedding.

Key finding

TigerBot improves over Llama-2 on evaluated benchmarks.

Numbers: English chat avg 69.87 vs 65.62 (+4.25 points); Chinese base avg 65.26 vs 52.27 (+12.99)

RoleEval — 6,000 bilingual multiple-choice questions testing LLMs' knowledge of 300 real and fictional characters

0.60

0.50

2

Role knowledge matters for apps that impersonate people or fictional characters; test models with RoleEval to reveal language and domain blind spots before deployment.

Key finding

RoleEval scale and scope

Numbers: 6,000 questions; 300 characters (200 global + 100 Chinese)

Airavata: an open-source Hindi instruction-tuned LLM plus datasets and evaluations

0.30

0.40

2

Airavata lowers the barrier to building Hindi language assistants by providing an open instruction-tuned model and data; use it for classification and assistant prototypes, but avoid high-stakes production without extra safety and factual checks.

Key finding

Instruction tuning substantially improves Hindi NLU on several benchmarks versus base OpenHathi.

Numbers: IndicXNLI 0-shot: OpenHathi 16.67 → Airavata 73.26 (+56.59) (Table 3)

Survey of 84 recent papers mapping models, datasets, benchmarks and gaps for Indic languages

0.60

0.45

0.65

2

Indic languages cover ~1.5–2 billion speakers; focused datasets and compact models enable local-language products with much lower compute and cost than retraining huge universal models.

Key finding

Number of papers reviewed

Numbers: 84 papers screened and summarized

Local training helps local knowledge and translation; many reasoning and code skills transfer from English

0.80

0.60

0.70

1

If your product needs general reasoning, code, or academic skills, English-scale models often suffice; buy or scale English data. If you need accurate local facts or English→Japanese translation, invest in Japanese training tokens.

Key finding

General (cross-task) ability correlates strongly with English compute budget.

Numbers: Pearson ρ = 0.916 between English ND and PC1

First open bilingual Spanish–English financial LLM, instruction data, and benchmark

0.50

0.60

0.50

1

Spanish is a large and growing financial-language market; a small, tuned bilingual model can beat generic SOTA on Spanish finance tasks, enabling better local analytics and customer support at lower compute cost.

Key finding

Authors assembled a bilingual instruction dataset for finance.

Numbers: ≈151k instruction samples from 15 datasets

EthioLLM: open multilingual LLMs and a new EthioBenchmark for five Ethiopian languages plus English

0.40

0.50

0.60

1

EthioLLM and EthioBenchmark make practical NLP for major Ethiopian languages possible with open models and data, lowering development time for local products like moderation, news categorization, and information extraction.

Key finding

EthioLLM-large achieves competitive or better results on news classification for Amharic.

Numbers: MasakhaNEWS Amharic weighted F1: EthioLLM-large 94.18 vs XLM-R 93.1

Decouple concepts from language: an MoE design that keeps strong multilingual accuracy and cuts token costs

0.60

0.70

1

SUTRA reduces non‑English inference cost while improving accuracy in many widely spoken languages, letting companies deploy one efficient model globally instead of many costly language-specific models.

Key finding

Large non-English gains on MMLU vs GPT-3.5.

Numbers: Hindi: SUTRA 68 vs GPT-3.5 39 (+29 pts)

Open-source Galician LLMs (1.3B) trained by continual pretraining on a 2.1B-word Galician corpus

0.40

0.50

0.60

1

Open Galician LLMs let local apps add Galician text generation or fine-tune models without huge compute budgets; expect modest gains for targeted tasks but plan extra cleaning and instruction tuning for production.

Key finding

Two 1.3B-parameter Galician decoder models were produced via continual pretraining on CorpusNÓS.

Numbers: 1.3B params; corpus = 2.13B tokens (2.1B words)

xTower: an LLM that explains translation errors and suggests fixes

0.60

0.40

0.50

1

xTower turns span-level error tags into human-readable explanations and targeted corrections, improving automated editing accuracy and saving post-editing time when integrated into MT QA pipelines.

Key finding

Explanations are rated more related when spans are human-annotated than when predicted by an automatic detector.

Numbers: Relatedness (6-point): human spans ≈ 4.3, XCOMET spans ≈ 3.2

Learned adapter pruning replaces grid search for cross-lingual LoRA merging

0.60

0.70

0

GRASP LoRA cuts tuning runs and labeled dev needs by learning a pruning rate online, lowering compute and development cost while often improving quality on low-resource language transfer.

Key finding

GRASP LoRA improves summarization metrics over best grid-search baseline on XL-Sum.

Numbers: Arabic: +0.88 BERT-F1, +1.75 BLEU-4, +2.13 ROUGE-L; Chinese: +1.62 BERT-F1, +1.73 BLEU-4, +1.45 ROUGE-L

Data mix (math, code, synthetic) plus the right base model beats scale for African-language CPT

0.70

0.50

0.70

0

You can substantially improve African-language quality and document translation by continued pretraining a strong open base model with a curated data mix instead of training from scratch.

Key finding

CPT data composition is the single strongest driver of gains.

Numbers: CMS recipe gave best scores on multiple tasks (e.g., Flores 66.23 at 12B).

Use bi-encoder confidence to call an LLM only on hard historical entity links

0.70

0.60

0

You can get better entity linking for noisy, multilingual historical texts without labeled data by combining a fast retriever with selective LLM calls, cutting inference cost and reducing hallucinations.

Key finding

Adaptive ensemble with LLMs improves F1 on standard historical EL benchmarks.

Numbers: 0.723 F1 on HIPE-2020 (English, MHEL-LLaMo van chain)

New Bangla riddle benchmark shows LLMs often copy surface words but fail real riddle reasoning

0.30

0.70

0.40

0

If you build Bangla NLP products that must reason with cultural metaphors or resolve wordplay, off-the-shelf LLMs are not yet reliable. Superficial word overlap can mask incorrect reasoning. Use targeted benchmarks like BANGLARIDDLEEVAL to validate models before deployment.

Key finding

Dataset size and structure

Numbers: 1,244 riddles -> 4 tasks -> 4,976 artifacts

An 11B Transformer tuned for Polish that rivals much larger models across European benchmarks

0.70

0.55

0.70

0

Get strong Polish and European-language performance with an 11B model that runs on consumer GPUs and supports quantized deployment—cut infrastructure costs versus 70B+ models while keeping high accuracy for local applications.

Key finding

Instruction-tuned Bielik-11B-v3 ranks among top open models on Polish benchmarks.

Numbers: Open PL LLM Leaderboard (Instruct): 65.93 average

Camellia: a new benchmark that measures how multilingual LLMs favor Western vs Asian entities in nine Asian languages

0.40

0.30

0.20

0

Multilingual LLMs can make culturally wrong or unfair choices in non-English settings. This affects product trust, moderation, search relevance, and personalization in Asia. Model selection and region-specific testing matter.

Key finding

LLMs often prefer Western entities even when the context requires an Asian entity.

Numbers: CBS ≈ 30–40% on culturally-grounded contexts (expected ~5%)

Use causal effects on multilingual feedback to decide when LLMs should abstain

0.60

0.40

0

CausalAbstain reduces wrong answers across multiple languages by selectively using model feedback, improving trust in multilingual QA systems while trading off higher API cost for better safety.

Key finding

CAUSAL-MULTI outperforms prior methods on the evaluated benchmarks.

Numbers: Average improvement +3.5% vs best competing method (across 3 models × 2 datasets)

A new Hindi analogy test (HATS) shows multilingual LLMs reason better when prompted in English and still make language-specific mistakes.

0.40

0.30

0.25

0

You cannot assume multilingual models reason equally well in non-English languages. For product features that rely on conceptual reasoning (search, question answering, exam prep), prompt language and translation choices materially change accuracy and safety.

Key finding

English-only prompts give the best accuracy across models and settings.

Numbers: Table 2: English-only top scores up to 79.75%