ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

Overview

Decision SnapshotReady For Pilot

The dataset and experiments are comprehensive and well-validated, but generation noise and LLM-judge false positives mean extra human verification is needed before production use.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Vynska Amalia Permadi, Xingwei Tan, Nafise Sadat Moosavi, Nikos Aletras

Links

Abstract / PDF / Data

Why It Matters For Business

Cultural mistakes cause reputational and legal risks in products serving local users; this dataset reveals where LLMs confuse prominent facts with situationally correct behavior and helps calibrate safer regional outputs.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

This paper introduces ID-MoCQA, a bilingual (Indonesian/English) dataset of 15,590 two-hop multiple-choice questions built by expanding province-specific single-hop items from IndoCulture. Questions force models to (1) identify a province from indirect cultural clues and (2) pick the culturally appropriate option. Human annotators reach 70.0% multi-hop accuracy (95.1% first-hop). Frontier LLMs (GPT-5, Claude-3.7-Sonnet, DeepSeek-V3) reach ~80–82% overall and often beat humans on multi-hop accuracy, but most models still struggle on the second hop: they pick prominent or well-documented cultural facts rather than situationally correct practices. The authors document generation noise (26% of a

Problem Statement

Existing cultural QA benchmarks mostly use single-hop questions that let models answer from a single fact. That setup hides whether models can chain cultural clues and apply context. The paper builds a large two-hop Indonesian dataset to test true cultural reasoning: identify the region from indirect clues, then select the context-appropriate cultural practice.

Main Contribution

A systematic LLM-guided pipeline that converts province-specific single-hop cultural QA into two-hop bilingual (ID/EN) questions across six clue types.

ID-MoCQA: a human-verified dataset of 15,590 multi-hop questions (7,795 per language) covering 11 provinces and 12 cultural topics.

Key Findings

Final dataset contains 15,590 bilingual multi-hop questions.

Numbers15,590 total; 7,795 per language

Practical UseUse ID-MoCQA for large-scale evaluation of cultural multi-hop reasoning in Indonesian and English.

Evidence Ref§4.6

Human baseline multi-hop accuracy is 70.0%; first-hop (province) accuracy is 95.1%.

NumbersHuman multi-hop 70.0%; first-hop 95.1%

Practical UseHumans easily identify provinces but still fail the second step; multi-hop answers are meaningfully harder than location recognition.

Evidence Ref§6.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	15,590 total (7,795 English, 7,795 Indonesian)	—	—	ID-MoCQA	Final dataset after multi-stage validation	§4.6
Accuracy	70.0%	—	—	ID-MoCQA (n=3 annotators)	Human baseline performance	§6.1

What To Try In 7 Days

Run your model on a ~200-sample slice of ID-MoCQA to spot cultural second-hop failures.

Measure first-hop vs both-hop accuracy to separate recognition vs situational reasoning gaps.

If you use LLMs to generate cultural content, add a human spot-check for comparison/intersection claims.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://arxiv.org/abs/2602.03709 https://arxiv.org/pdf/2602.03709v1

Risks & Boundaries

Limitations

Generation noise: ~26% of sampled questions had significant factual or structural errors.

COMPARISON and INTERSECTION types are error-prone and require external data verification.

When Not To Use

Do not use ID-MoCQA to claim global cultural competence beyond Indonesia.

Avoid using raw generated items for training without human verification, especially comparison/intersection items.

Failure Modes

Models select the most documented or prominent cultural fact rather than the context-appropriate practice.

LLM-as-a-judge accepts problematic items at ~22% false-positive rate.

Core Entities

Models

GPT-5Claude-3.7-SonnetDeepSeek-V3Gemma2-27B-InstructLlama3.3-70B-InstructQwen2.5-72B-InstructLlama3.1-8BQwen2.5-7BMerak-7BSeaLLM-7BGPT-4o

Metrics

AccuracyprecisionrecallCohen's kappaICC

Datasets

ID-MoCQAIndoCultureHotpotQA2WikiMultiHopQAMuSiQue

Benchmarks

ID-MoCQAIndoCulture

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Final dataset contains 15,590 bilingual multi-hop questions.

Human baseline multi-hop accuracy is 70.0%; first-hop (province) accuracy is 95.1%.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding