ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

February 3, 20267 min

Overview

Decision SnapshotReady For Pilot

The dataset and experiments are comprehensive and well-validated, but generation noise and LLM-judge false positives mean extra human verification is needed before production use.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Vynska Amalia Permadi, Xingwei Tan, Nafise Sadat Moosavi, Nikos Aletras

Links

Abstract / PDF / Data

Why It Matters For Business

Cultural mistakes cause reputational and legal risks in products serving local users; this dataset reveals where LLMs confuse prominent facts with situationally correct behavior and helps calibrate safer regional outputs.

Who Should Care

Summary TLDR

This paper introduces ID-MoCQA, a bilingual (Indonesian/English) dataset of 15,590 two-hop multiple-choice questions built by expanding province-specific single-hop items from IndoCulture. Questions force models to (1) identify a province from indirect cultural clues and (2) pick the culturally appropriate option. Human annotators reach 70.0% multi-hop accuracy (95.1% first-hop). Frontier LLMs (GPT-5, Claude-3.7-Sonnet, DeepSeek-V3) reach ~80–82% overall and often beat humans on multi-hop accuracy, but most models still struggle on the second hop: they pick prominent or well-documented cultural facts rather than situationally correct practices. The authors document generation noise (26% of a

Problem Statement

Existing cultural QA benchmarks mostly use single-hop questions that let models answer from a single fact. That setup hides whether models can chain cultural clues and apply context. The paper builds a large two-hop Indonesian dataset to test true cultural reasoning: identify the region from indirect clues, then select the context-appropriate cultural practice.

Main Contribution

A systematic LLM-guided pipeline that converts province-specific single-hop cultural QA into two-hop bilingual (ID/EN) questions across six clue types.

ID-MoCQA: a human-verified dataset of 15,590 multi-hop questions (7,795 per language) covering 11 provinces and 12 cultural topics.

Key Findings

Final dataset contains 15,590 bilingual multi-hop questions.

Numbers15,590 total; 7,795 per language

Practical UseUse ID-MoCQA for large-scale evaluation of cultural multi-hop reasoning in Indonesian and English.

Evidence Ref§4.6

Human baseline multi-hop accuracy is 70.0%; first-hop (province) accuracy is 95.1%.

NumbersHuman multi-hop 70.0%; first-hop 95.1%

Practical UseHumans easily identify provinces but still fail the second step; multi-hop answers are meaningfully harder than location recognition.

Evidence Ref§6.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size15,590 total (7,795 English, 7,795 Indonesian)ID-MoCQAFinal dataset after multi-stage validation§4.6
Accuracy70.0%ID-MoCQA (n=3 annotators)Human baseline performance§6.1

What To Try In 7 Days

Run your model on a ~200-sample slice of ID-MoCQA to spot cultural second-hop failures.

Measure first-hop vs both-hop accuracy to separate recognition vs situational reasoning gaps.

If you use LLMs to generate cultural content, add a human spot-check for comparison/intersection claims.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Generation noise: ~26% of sampled questions had significant factual or structural errors.

COMPARISON and INTERSECTION types are error-prone and require external data verification.

When Not To Use

Do not use ID-MoCQA to claim global cultural competence beyond Indonesia.

Avoid using raw generated items for training without human verification, especially comparison/intersection items.

Failure Modes

Models select the most documented or prominent cultural fact rather than the context-appropriate practice.

LLM-as-a-judge accepts problematic items at ~22% false-positive rate.

Core Entities

Models

GPT-5Claude-3.7-SonnetDeepSeek-V3Gemma2-27B-InstructLlama3.3-70B-InstructQwen2.5-72B-InstructLlama3.1-8BQwen2.5-7BMerak-7BSeaLLM-7BGPT-4o

Metrics

AccuracyprecisionrecallCohen's kappaICC

Datasets

ID-MoCQAIndoCultureHotpotQA2WikiMultiHopQAMuSiQue

Benchmarks

ID-MoCQAIndoCulture