Overview
The dataset and experiments are comprehensive and well-validated, but generation noise and LLM-judge false positives mean extra human verification is needed before production use.
Citations0
Evidence Strength0.80
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 1/6
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
Cultural mistakes cause reputational and legal risks in products serving local users; this dataset reveals where LLMs confuse prominent facts with situationally correct behavior and helps calibrate safer regional outputs.
Who Should Care
Summary TLDR
This paper introduces ID-MoCQA, a bilingual (Indonesian/English) dataset of 15,590 two-hop multiple-choice questions built by expanding province-specific single-hop items from IndoCulture. Questions force models to (1) identify a province from indirect cultural clues and (2) pick the culturally appropriate option. Human annotators reach 70.0% multi-hop accuracy (95.1% first-hop). Frontier LLMs (GPT-5, Claude-3.7-Sonnet, DeepSeek-V3) reach ~80–82% overall and often beat humans on multi-hop accuracy, but most models still struggle on the second hop: they pick prominent or well-documented cultural facts rather than situationally correct practices. The authors document generation noise (26% of a
Problem Statement
Existing cultural QA benchmarks mostly use single-hop questions that let models answer from a single fact. That setup hides whether models can chain cultural clues and apply context. The paper builds a large two-hop Indonesian dataset to test true cultural reasoning: identify the region from indirect clues, then select the context-appropriate cultural practice.
Main Contribution
A systematic LLM-guided pipeline that converts province-specific single-hop cultural QA into two-hop bilingual (ID/EN) questions across six clue types.
ID-MoCQA: a human-verified dataset of 15,590 multi-hop questions (7,795 per language) covering 11 provinces and 12 cultural topics.
Key Findings
Final dataset contains 15,590 bilingual multi-hop questions.
Human baseline multi-hop accuracy is 70.0%; first-hop (province) accuracy is 95.1%.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset size | 15,590 total (7,795 English, 7,795 Indonesian) | — | — | ID-MoCQA | Final dataset after multi-stage validation | §4.6 |
| Accuracy | 70.0% | — | — | ID-MoCQA (n=3 annotators) | Human baseline performance | §6.1 |
What To Try In 7 Days
Run your model on a ~200-sample slice of ID-MoCQA to spot cultural second-hop failures.
Measure first-hop vs both-hop accuracy to separate recognition vs situational reasoning gaps.
If you use LLMs to generate cultural content, add a human spot-check for comparison/intersection claims.
Reproducibility
Risks & Boundaries
Limitations
Generation noise: ~26% of sampled questions had significant factual or structural errors.
COMPARISON and INTERSECTION types are error-prone and require external data verification.
When Not To Use
Do not use ID-MoCQA to claim global cultural competence beyond Indonesia.
Avoid using raw generated items for training without human verification, especially comparison/intersection items.
Failure Modes
Models select the most documented or prominent cultural fact rather than the context-appropriate practice.
LLM-as-a-judge accepts problematic items at ~22% false-positive rate.

