ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

February 3, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Vynska Amalia Permadi, Xingwei Tan, Nafise Sadat Moosavi, Nikos Aletras

Links

Abstract / PDF

Why It Matters For Business

Cultural mistakes cause reputational and legal risks in products serving local users; this dataset reveals where LLMs confuse prominent facts with situationally correct behavior and helps calibrate safer regional outputs.

Summary TLDR

This paper introduces ID-MoCQA, a bilingual (Indonesian/English) dataset of 15,590 two-hop multiple-choice questions built by expanding province-specific single-hop items from IndoCulture. Questions force models to (1) identify a province from indirect cultural clues and (2) pick the culturally appropriate option. Human annotators reach 70.0% multi-hop accuracy (95.1% first-hop). Frontier LLMs (GPT-5, Claude-3.7-Sonnet, DeepSeek-V3) reach ~80–82% overall and often beat humans on multi-hop accuracy, but most models still struggle on the second hop: they pick prominent or well-documented cultural facts rather than situationally correct practices. The authors document generation noise (26% of a

Problem Statement

Existing cultural QA benchmarks mostly use single-hop questions that let models answer from a single fact. That setup hides whether models can chain cultural clues and apply context. The paper builds a large two-hop Indonesian dataset to test true cultural reasoning: identify the region from indirect clues, then select the context-appropriate cultural practice.

Main Contribution

A systematic LLM-guided pipeline that converts province-specific single-hop cultural QA into two-hop bilingual (ID/EN) questions across six clue types.

ID-MoCQA: a human-verified dataset of 15,590 multi-hop questions (7,795 per language) covering 11 provinces and 12 cultural topics.

Large-scale evaluation of 10 LLMs and a human baseline, plus an analysis of question quality, model failure modes, and Chain-of-Thought effects.

Key Findings

Final dataset contains 15,590 bilingual multi-hop questions.

Numbers15,590 total; 7,795 per language

Human baseline multi-hop accuracy is 70.0%; first-hop (province) accuracy is 95.1%.

NumbersHuman multi-hop 70.0%; first-hop 95.1%

Frontier LLMs achieve ~80–82% overall multi-hop accuracy and often outperform humans in Indonesian by >10% for top models.

NumbersClaude-3.7-Sonnet 81.98% ID; GPT-5 81.37% ID

Automatic generation quality is noisy: in a 3,000-sample manual check, 57.07% were OK and 26.20% had significant issues.

NumbersOK 57.07%; Significant 26.20%

LLM-as-a-judge filtering achieved precision 0.78 and recall 0.82 versus human labels on a dual-annotated subset.

NumbersPrecision 0.78; Recall 0.82

Models identify provinces far more reliably than they pick context-appropriate answers: frontier LLMs have >96% first-hop accuracy but 18–23% lower when requiring both steps correct.

NumbersFirst-hop >96%; 18–23% drop to both-correct

Results

Dataset size

Value15,590 total (7,795 English, 7,795 Indonesian)

Accuracy

Value70.0%

Accuracy

Value95.1%

Accuracy

ValueClaude-3.7-Sonnet 81.98%; GPT-5 81.37%

BaselineHuman 70.0%

LLM-as-a-judge filtering performance

ValuePrecision 0.78; Recall 0.82

Manual QC sample (3,000)

ValueOK 57.07%; Significant 26.20%

Who Should Care

What To Try In 7 Days

Run your model on a ~200-sample slice of ID-MoCQA to spot cultural second-hop failures.

Measure first-hop vs both-hop accuracy to separate recognition vs situational reasoning gaps.

If you use LLMs to generate cultural content, add a human spot-check for comparison/intersection claims.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Generation noise: ~26% of sampled questions had significant factual or structural errors.
  • COMPARISON and INTERSECTION types are error-prone and require external data verification.
  • Geographic scope: covers 11 Indonesian provinces, so results do not generalize to all regions or other countries.
  • Heavy reliance on LLMs (Claude-3.7-Sonnet) during data creation introduces systematic biases.

When Not To Use

  • Do not use ID-MoCQA to claim global cultural competence beyond Indonesia.
  • Avoid using raw generated items for training without human verification, especially comparison/intersection items.
  • Not suitable for tasks requiring precise, sourced comparative statistics without citation checks.

Failure Modes

  • Models select the most documented or prominent cultural fact rather than the context-appropriate practice.
  • LLM-as-a-judge accepts problematic items at ~22% false-positive rate.
  • Comparison questions frequently contain incorrect ranking or unverifiable claims.

Core Entities

Models

  • GPT-5
  • Claude-3.7-Sonnet
  • DeepSeek-V3
  • Gemma2-27B-Instruct
  • Llama3.3-70B-Instruct
  • Qwen2.5-72B-Instruct
  • Llama3.1-8B
  • Qwen2.5-7B
  • Merak-7B
  • SeaLLM-7B
  • GPT-4o

Metrics

  • Accuracy
  • precision
  • recall
  • Cohen's kappa
  • ICC

Datasets

  • ID-MoCQA
  • IndoCulture
  • HotpotQA
  • 2WikiMultiHopQA
  • MuSiQue

Benchmarks

  • ID-MoCQA
  • IndoCulture