Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Cultural mistakes cause reputational and legal risks in products serving local users; this dataset reveals where LLMs confuse prominent facts with situationally correct behavior and helps calibrate safer regional outputs.
Summary TLDR
This paper introduces ID-MoCQA, a bilingual (Indonesian/English) dataset of 15,590 two-hop multiple-choice questions built by expanding province-specific single-hop items from IndoCulture. Questions force models to (1) identify a province from indirect cultural clues and (2) pick the culturally appropriate option. Human annotators reach 70.0% multi-hop accuracy (95.1% first-hop). Frontier LLMs (GPT-5, Claude-3.7-Sonnet, DeepSeek-V3) reach ~80–82% overall and often beat humans on multi-hop accuracy, but most models still struggle on the second hop: they pick prominent or well-documented cultural facts rather than situationally correct practices. The authors document generation noise (26% of a
Problem Statement
Existing cultural QA benchmarks mostly use single-hop questions that let models answer from a single fact. That setup hides whether models can chain cultural clues and apply context. The paper builds a large two-hop Indonesian dataset to test true cultural reasoning: identify the region from indirect clues, then select the context-appropriate cultural practice.
Main Contribution
A systematic LLM-guided pipeline that converts province-specific single-hop cultural QA into two-hop bilingual (ID/EN) questions across six clue types.
ID-MoCQA: a human-verified dataset of 15,590 multi-hop questions (7,795 per language) covering 11 provinces and 12 cultural topics.
Large-scale evaluation of 10 LLMs and a human baseline, plus an analysis of question quality, model failure modes, and Chain-of-Thought effects.
Key Findings
Final dataset contains 15,590 bilingual multi-hop questions.
Human baseline multi-hop accuracy is 70.0%; first-hop (province) accuracy is 95.1%.
Frontier LLMs achieve ~80–82% overall multi-hop accuracy and often outperform humans in Indonesian by >10% for top models.
Automatic generation quality is noisy: in a 3,000-sample manual check, 57.07% were OK and 26.20% had significant issues.
LLM-as-a-judge filtering achieved precision 0.78 and recall 0.82 versus human labels on a dual-annotated subset.
Models identify provinces far more reliably than they pick context-appropriate answers: frontier LLMs have >96% first-hop accuracy but 18–23% lower when requiring both steps correct.
Results
Dataset size
Accuracy
Accuracy
Accuracy
LLM-as-a-judge filtering performance
Manual QC sample (3,000)
Who Should Care
What To Try In 7 Days
Run your model on a ~200-sample slice of ID-MoCQA to spot cultural second-hop failures.
Measure first-hop vs both-hop accuracy to separate recognition vs situational reasoning gaps.
If you use LLMs to generate cultural content, add a human spot-check for comparison/intersection claims.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Generation noise: ~26% of sampled questions had significant factual or structural errors.
- COMPARISON and INTERSECTION types are error-prone and require external data verification.
- Geographic scope: covers 11 Indonesian provinces, so results do not generalize to all regions or other countries.
- Heavy reliance on LLMs (Claude-3.7-Sonnet) during data creation introduces systematic biases.
When Not To Use
- Do not use ID-MoCQA to claim global cultural competence beyond Indonesia.
- Avoid using raw generated items for training without human verification, especially comparison/intersection items.
- Not suitable for tasks requiring precise, sourced comparative statistics without citation checks.
Failure Modes
- Models select the most documented or prominent cultural fact rather than the context-appropriate practice.
- LLM-as-a-judge accepts problematic items at ~22% false-positive rate.
- Comparison questions frequently contain incorrect ranking or unverifiable claims.
Core Entities
Models
- GPT-5
- Claude-3.7-Sonnet
- DeepSeek-V3
- Gemma2-27B-Instruct
- Llama3.3-70B-Instruct
- Qwen2.5-72B-Instruct
- Llama3.1-8B
- Qwen2.5-7B
- Merak-7B
- SeaLLM-7B
- GPT-4o
Metrics
- Accuracy
- precision
- recall
- Cohen's kappa
- ICC
Datasets
- ID-MoCQA
- IndoCulture
- HotpotQA
- 2WikiMultiHopQA
- MuSiQue
Benchmarks
- ID-MoCQA
- IndoCulture

