MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

September 8, 20257 min

Overview

Decision SnapshotNeeds Validation

Benchmark and multi-method evaluation are solid; human evaluation sample is small (20 items). Metric and translation biases lower evidence for Valencian and other low-resource languages.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 60%

Novelty: 80%

Authors

Ivan Martínez-Murillo, Elena Lloret, Paloma Moreda, Albert Gatt

Links

Abstract / PDF / Data

Why It Matters For Business

Multilingual LLMs can produce worse commonsense text in non-English languages. If your product serves multi-language users, relying on off-the-shelf models without language-specific testing risks poor UX and wrong behavior in lower-resource languages.

Who Should Care

Summary TLDR

The authors introduce MULTICOM, a 4‑language benchmark (English, Spanish, Dutch, Valencian) for a constrained commonsense generation task: produce a natural sentence containing three given keywords. They evaluate five open-source LLM families (LLaMA, Qwen, Gemma, EuroLLM, Salamandra) with automatic metrics, two LLMs-as-judges (Prometheus, JudgeLM), and a small human study. Models perform best in English; performance drops in less-resourced languages. Context injection helps sometimes for low-resource languages but can hurt other cases. Dataset is public on HuggingFace.

Problem Statement

Do current LLMs generate equally commonsensical text across languages? The paper asks whether multilingual LLMs retain commonsense ability outside English, and whether giving a short context helps.

Main Contribution

MULTICOM: a multilingual benchmark extending a Spanish commonsense corpus to English, Dutch and Valencian (training: 3,875 triples; test: 3,876 instances; 969 per language).

A controlled evaluation across five open-source LLM families and multiple sizes and instruction-tuned variants.

Key Findings

Models are consistently stronger at generating commonsense sentences in English than in other evaluated languages.

NumbersLLaMA BERTScore (Ref) EN 0.903 vs ES 0.771 (Table 1).

Practical UseIf you need reliable commonsense output, prefer English prompts or fine-tune for the target language before deployment.

Evidence RefTable 1; Sections 4.1–4.3

Context injection helps under-resourced languages more often, but its effect varies by model.

NumbersContext yield mixed metric changes; e.g., for some LLaMA runs Dutch/Valencian improved versus no-context (Table 1 and §4

Practical UseTry adding targeted context when working in low-resource languages, but validate because context can also reduce quality for some models.

Evidence RefTable 1; Sections 4.1 and 4.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BERTScore (LLaMA-3.2-3B Instruct)EN 0.908 vs ES 0.777 (Ref comparisons)EN − ES = +0.131MULTICOM testTable 1 shows higher BERTScore for English referencesTable 1
USE cosine (LLaMA-3.2-3B Base)ES Ref 0.632 vs EN Ref 0.555ES − EN = +0.077 (but metric may not reflect commonsense)MULTICOM testTable 1; authors caution metric sensitivityTable 1; §4.1

What To Try In 7 Days

Run MULTICOM on your model for the target languages to spot commonsense gaps quickly.

Compare outputs with and without short context prompts; keep the variant that improves quality for your language.

Use an LLM-as-judge for fast triage but validate with a small human sample for critical languages.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Automatic metrics and language encoders/parsers are English-leaning and can bias scores toward English (Limitations).

LLM-as-judge prompts and rubrics were written in English; judge performance may vary by target language exposure (§5).

When Not To Use

Do not use MULTICOM alone as a full measure of multilingual commonsense for production safety decisions.

Avoid relying only on LLM-as-judge scores for low-resource languages without human validation.

Failure Modes

Metric failure: embedding-based scores may not separate commonsense from surface similarity (authors note BERTScore limits).

Judge bias: LLM evaluators can give inconsistent scores across languages and models.

Core Entities

Models

LLaMA-3.2 (1B, 3B; base and instruct)Qwen3 (4B, 8B; base and instruct)Gemma-2 (3B; base and instruct; 9B excluded)EuroLLM (1.7B, 9B; base and instruct)Salamandra (3B, 7B; base and instruct)

Metrics

BERTScoreUniversal Sentence Encoder (USE) + CosineDependency parsing + LevenshteinDependency triplet vector similarity (SpaCy) + CosineHuman annotationLLM-as-judge scores (Prometheus, JudgeLM)

Datasets

MULTICOM (this paper)COCOTEROS (Spanish source)CommonGen

Benchmarks

MULTICOM