Overview
Production Readiness
0.6
Novelty Score
0.8
Cost Impact Score
0.3
Citation Count
0
Why It Matters For Business
Multilingual LLMs can produce worse commonsense text in non-English languages. If your product serves multi-language users, relying on off-the-shelf models without language-specific testing risks poor UX and wrong behavior in lower-resource languages.
Summary TLDR
The authors introduce MULTICOM, a 4‑language benchmark (English, Spanish, Dutch, Valencian) for a constrained commonsense generation task: produce a natural sentence containing three given keywords. They evaluate five open-source LLM families (LLaMA, Qwen, Gemma, EuroLLM, Salamandra) with automatic metrics, two LLMs-as-judges (Prometheus, JudgeLM), and a small human study. Models perform best in English; performance drops in less-resourced languages. Context injection helps sometimes for low-resource languages but can hurt other cases. Dataset is public on HuggingFace.
Problem Statement
Do current LLMs generate equally commonsensical text across languages? The paper asks whether multilingual LLMs retain commonsense ability outside English, and whether giving a short context helps.
Main Contribution
MULTICOM: a multilingual benchmark extending a Spanish commonsense corpus to English, Dutch and Valencian (training: 3,875 triples; test: 3,876 instances; 969 per language).
A controlled evaluation across five open-source LLM families and multiple sizes and instruction-tuned variants.
A three-way evaluation mix: automatic metrics, two LLM-as-judge models (Prometheus, JudgeLM), and human annotations on 20 test items.
Key Findings
Models are consistently stronger at generating commonsense sentences in English than in other evaluated languages.
Context injection helps under-resourced languages more often, but its effect varies by model.
LLM-as-judge scores broadly match human judgments on rank order (English top), but judges show inconsistencies.
Evaluation metrics and language-specific tools can bias results toward English.
Results
BERTScore (LLaMA-3.2-3B Instruct)
USE cosine (LLaMA-3.2-3B Base)
Human annotator majority agreement
Who Should Care
What To Try In 7 Days
Run MULTICOM on your model for the target languages to spot commonsense gaps quickly.
Compare outputs with and without short context prompts; keep the variant that improves quality for your language.
Use an LLM-as-judge for fast triage but validate with a small human sample for critical languages.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Automatic metrics and language encoders/parsers are English-leaning and can bias scores toward English (Limitations).
- LLM-as-judge prompts and rubrics were written in English; judge performance may vary by target language exposure (§5).
- Human evaluation is small (20 items) and Valencian annotators were Catalan speakers — treat Valencian results cautiously (§4.3, Limitations).
- Study includes only open-source models; proprietary models (GPT-4, Grok, Gemini) were not evaluated.
When Not To Use
- Do not use MULTICOM alone as a full measure of multilingual commonsense for production safety decisions.
- Avoid relying only on LLM-as-judge scores for low-resource languages without human validation.
Failure Modes
- Metric failure: embedding-based scores may not separate commonsense from surface similarity (authors note BERTScore limits).
- Judge bias: LLM evaluators can give inconsistent scores across languages and models.
- Translation artifacts: machine translations and keyword alignment can change keyword presence or meaning across languages.
- Context injection backfire: adding context can sometimes reduce generation quality for some models.
Core Entities
Models
- LLaMA-3.2 (1B, 3B; base and instruct)
- Qwen3 (4B, 8B; base and instruct)
- Gemma-2 (3B; base and instruct; 9B excluded)
- EuroLLM (1.7B, 9B; base and instruct)
- Salamandra (3B, 7B; base and instruct)
Metrics
- BERTScore
- Universal Sentence Encoder (USE) + Cosine
- Dependency parsing + Levenshtein
- Dependency triplet vector similarity (SpaCy) + Cosine
- Human annotation
- LLM-as-judge scores (Prometheus, JudgeLM)
Datasets
- MULTICOM (this paper)
- COCOTEROS (Spanish source)
- CommonGen
Benchmarks
- MULTICOM

