MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

Overview

Decision SnapshotNeeds Validation

Benchmark and multi-method evaluation are solid; human evaluation sample is small (20 items). Metric and translation biases lower evidence for Valencian and other low-resource languages.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 60%

Novelty: 80%

Authors

Ivan Martínez-Murillo, Elena Lloret, Paloma Moreda, Albert Gatt

Links

Abstract / PDF / Data

Why It Matters For Business

Multilingual LLMs can produce worse commonsense text in non-English languages. If your product serves multi-language users, relying on off-the-shelf models without language-specific testing risks poor UX and wrong behavior in lower-resource languages.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

The authors introduce MULTICOM, a 4‑language benchmark (English, Spanish, Dutch, Valencian) for a constrained commonsense generation task: produce a natural sentence containing three given keywords. They evaluate five open-source LLM families (LLaMA, Qwen, Gemma, EuroLLM, Salamandra) with automatic metrics, two LLMs-as-judges (Prometheus, JudgeLM), and a small human study. Models perform best in English; performance drops in less-resourced languages. Context injection helps sometimes for low-resource languages but can hurt other cases. Dataset is public on HuggingFace.

Problem Statement

Do current LLMs generate equally commonsensical text across languages? The paper asks whether multilingual LLMs retain commonsense ability outside English, and whether giving a short context helps.

Main Contribution

MULTICOM: a multilingual benchmark extending a Spanish commonsense corpus to English, Dutch and Valencian (training: 3,875 triples; test: 3,876 instances; 969 per language).

A controlled evaluation across five open-source LLM families and multiple sizes and instruction-tuned variants.

Key Findings

Models are consistently stronger at generating commonsense sentences in English than in other evaluated languages.

NumbersLLaMA BERTScore (Ref) EN 0.903 vs ES 0.771 (Table 1).

Practical UseIf you need reliable commonsense output, prefer English prompts or fine-tune for the target language before deployment.

Evidence RefTable 1; Sections 4.1–4.3

Context injection helps under-resourced languages more often, but its effect varies by model.

NumbersContext yield mixed metric changes; e.g., for some LLaMA runs Dutch/Valencian improved versus no-context (Table 1 and §4

Practical UseTry adding targeted context when working in low-resource languages, but validate because context can also reduce quality for some models.

Evidence RefTable 1; Sections 4.1 and 4.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BERTScore (LLaMA-3.2-3B Instruct)	EN 0.908 vs ES 0.777 (Ref comparisons)	—	EN − ES = +0.131	MULTICOM test	Table 1 shows higher BERTScore for English references	Table 1
USE cosine (LLaMA-3.2-3B Base)	ES Ref 0.632 vs EN Ref 0.555	—	ES − EN = +0.077 (but metric may not reflect commonsense)	MULTICOM test	Table 1; authors caution metric sensitivity	Table 1; §4.1

What To Try In 7 Days

Run MULTICOM on your model for the target languages to spot commonsense gaps quickly.

Compare outputs with and without short context prompts; keep the variant that improves quality for your language.

Use an LLM-as-judge for fast triage but validate with a small human sample for critical languages.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/datasets/gplsi/MULTICOM

Risks & Boundaries

Limitations

Automatic metrics and language encoders/parsers are English-leaning and can bias scores toward English (Limitations).

LLM-as-judge prompts and rubrics were written in English; judge performance may vary by target language exposure (§5).

When Not To Use

Do not use MULTICOM alone as a full measure of multilingual commonsense for production safety decisions.

Avoid relying only on LLM-as-judge scores for low-resource languages without human validation.

Failure Modes

Metric failure: embedding-based scores may not separate commonsense from surface similarity (authors note BERTScore limits).

Judge bias: LLM evaluators can give inconsistent scores across languages and models.

Core Entities

Models

LLaMA-3.2 (1B, 3B; base and instruct)Qwen3 (4B, 8B; base and instruct)Gemma-2 (3B; base and instruct; 9B excluded)EuroLLM (1.7B, 9B; base and instruct)Salamandra (3B, 7B; base and instruct)

Metrics

BERTScoreUniversal Sentence Encoder (USE) + CosineDependency parsing + LevenshteinDependency triplet vector similarity (SpaCy) + CosineHuman annotationLLM-as-judge scores (Prometheus, JudgeLM)

Datasets

MULTICOM (this paper)COCOTEROS (Spanish source)CommonGen

Benchmarks

MULTICOM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Models are consistently stronger at generating commonsense sentences in English than in other evaluated languages.

Context injection helps under-resourced languages more often, but its effect varies by model.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding