MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

September 8, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.8

Cost Impact Score

0.3

Citation Count

0

Authors

Ivan Martínez-Murillo, Elena Lloret, Paloma Moreda, Albert Gatt

Links

Abstract / PDF

Why It Matters For Business

Multilingual LLMs can produce worse commonsense text in non-English languages. If your product serves multi-language users, relying on off-the-shelf models without language-specific testing risks poor UX and wrong behavior in lower-resource languages.

Summary TLDR

The authors introduce MULTICOM, a 4‑language benchmark (English, Spanish, Dutch, Valencian) for a constrained commonsense generation task: produce a natural sentence containing three given keywords. They evaluate five open-source LLM families (LLaMA, Qwen, Gemma, EuroLLM, Salamandra) with automatic metrics, two LLMs-as-judges (Prometheus, JudgeLM), and a small human study. Models perform best in English; performance drops in less-resourced languages. Context injection helps sometimes for low-resource languages but can hurt other cases. Dataset is public on HuggingFace.

Problem Statement

Do current LLMs generate equally commonsensical text across languages? The paper asks whether multilingual LLMs retain commonsense ability outside English, and whether giving a short context helps.

Main Contribution

MULTICOM: a multilingual benchmark extending a Spanish commonsense corpus to English, Dutch and Valencian (training: 3,875 triples; test: 3,876 instances; 969 per language).

A controlled evaluation across five open-source LLM families and multiple sizes and instruction-tuned variants.

A three-way evaluation mix: automatic metrics, two LLM-as-judge models (Prometheus, JudgeLM), and human annotations on 20 test items.

Key Findings

Models are consistently stronger at generating commonsense sentences in English than in other evaluated languages.

NumbersLLaMA BERTScore (Ref) EN 0.903 vs ES 0.771 (Table 1).

Context injection helps under-resourced languages more often, but its effect varies by model.

NumbersContext yield mixed metric changes; e.g., for some LLaMA runs Dutch/Valencian improved versus no-context (Table 1 and §4

LLM-as-judge scores broadly match human judgments on rank order (English top), but judges show inconsistencies.

NumbersHuman majority agreement: EN 0.75, ES 0.80, NL 0.75, Valencian 0.95 (human eval on 20 items, §4.3).

Evaluation metrics and language-specific tools can bias results toward English.

NumbersAuthors note BERTScore and language encoders/parsers are English-leaning; Valencian results flagged as unreliable (§5; 3

Results

BERTScore (LLaMA-3.2-3B Instruct)

ValueEN 0.908 vs ES 0.777 (Ref comparisons)

USE cosine (LLaMA-3.2-3B Base)

ValueES Ref 0.632 vs EN Ref 0.555

Human annotator majority agreement

ValueEN 0.75, ES 0.80, NL 0.75, Valencian 0.95

Who Should Care

What To Try In 7 Days

Run MULTICOM on your model for the target languages to spot commonsense gaps quickly.

Compare outputs with and without short context prompts; keep the variant that improves quality for your language.

Use an LLM-as-judge for fast triage but validate with a small human sample for critical languages.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Automatic metrics and language encoders/parsers are English-leaning and can bias scores toward English (Limitations).
  • LLM-as-judge prompts and rubrics were written in English; judge performance may vary by target language exposure (§5).
  • Human evaluation is small (20 items) and Valencian annotators were Catalan speakers — treat Valencian results cautiously (§4.3, Limitations).
  • Study includes only open-source models; proprietary models (GPT-4, Grok, Gemini) were not evaluated.

When Not To Use

  • Do not use MULTICOM alone as a full measure of multilingual commonsense for production safety decisions.
  • Avoid relying only on LLM-as-judge scores for low-resource languages without human validation.

Failure Modes

  • Metric failure: embedding-based scores may not separate commonsense from surface similarity (authors note BERTScore limits).
  • Judge bias: LLM evaluators can give inconsistent scores across languages and models.
  • Translation artifacts: machine translations and keyword alignment can change keyword presence or meaning across languages.
  • Context injection backfire: adding context can sometimes reduce generation quality for some models.

Core Entities

Models

  • LLaMA-3.2 (1B, 3B; base and instruct)
  • Qwen3 (4B, 8B; base and instruct)
  • Gemma-2 (3B; base and instruct; 9B excluded)
  • EuroLLM (1.7B, 9B; base and instruct)
  • Salamandra (3B, 7B; base and instruct)

Metrics

  • BERTScore
  • Universal Sentence Encoder (USE) + Cosine
  • Dependency parsing + Levenshtein
  • Dependency triplet vector similarity (SpaCy) + Cosine
  • Human annotation
  • LLM-as-judge scores (Prometheus, JudgeLM)

Datasets

  • MULTICOM (this paper)
  • COCOTEROS (Spanish source)
  • CommonGen

Benchmarks

  • MULTICOM