A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

March 2, 20268 min

Overview

Decision SnapshotReady For Pilot

The benchmark is ready for in-house stress testing and researcher use; expect nontrivial compute costs and the need for human review of low-confidence verifier outputs.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/6

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Viet-Thanh Pham, Lizhen Qu, Thuy-Trang Vu, Gholamreza Haffari, Dinh Phung

Links

Abstract / PDF / Data

Why It Matters For Business

If you deploy LLMs as agents that interact with people, you must test them in culturally diverse, multi-party settings; naive models can break local norms while achieving tasks, exposing reputational and compliance risks.

Who Should Care

Summary TLDR

LiveCultureBench is a runtime benchmark that places LLMs as agents in a simulated small city and scores both task success and adherence to socio-cultural norms. The environment uses realistic demographic sampling, location-conditioned norms (from CultureBank), supporting agents that apply social pressure, and an LLM-based verifier that is calibrated with conformal sampling. Results show consistent cross-cultural gaps, norm adherence drops as cultural diversity rises, models often prioritize task completion over norms, and an uncertainty-aware verifier can control verification risk but cannot fully replace human oversight.

Problem Statement

Current LLM evaluations focus on task success or static cultural prompts. They miss how cultural misalignment appears when models act as agents over time, and they rely on LLM judges without quantifying judge reliability. This paper builds a dynamic, multi-cultural simulation and measures both task completion and norm adherence, plus verifier trustworthiness.

Main Contribution

A modular, goal-driven social simulation (small town graph) that places LLMs as target and supporting agents and checks location-conditioned socio-cultural norms.

An evaluation protocol and time-aggregated metrics that jointly measure goal completion, norm violations, profile faithfulness, context awareness, and coherence.

Key Findings

An LLM-based verifier achieves high task-level agreement with humans on held-out test sets.

NumbersGoal Completion F1 92.41; Contextual Awareness F1 95.27; Coherence F1 96.08

Practical UseAutomated verification can scale evaluation for clear, short-horizon checks, but expect some residual errors; use verifier outputs for aggregate trends and flag low-confidence cases for human review.

Evidence RefTable 1 (Verifier Agent F1 scores)

Norm adherence often falls as cultural diversity in the scene increases.

NumbersGemini-2.5-Pro Norm Adherence drops 0.95 -> 0.85−0.10) when moving from single-culture to all cultures

Practical UseTest agents in multicultural mixes before deployment; a model that is fine in single-culture tests may violate norms in diverse settings.

Evidence RefTable 13 (Norm Adherence by #Cultures) / Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Verifier F1 (Goal Completion)92.41200 human-annotated test samplesTable 1 reports F1=92.41 for Goal CompletionTable 1
Verifier F1 (Norm Violation)89.36200 human-annotated test samplesTable 1 reports F1=89.36 for Norm ViolationTable 1

What To Try In 7 Days

Run the model in LiveCultureBench or similar multi-agent scenarios for a handful of personas to spot norm violations before release.

Add verifier-calibrated checks (conformal sampling) to flag low-confidence behavior and route those cases to human review.

Introduce simple cost terms or rule checks that penalize norm violations during decision-making or fine-tuning.

Agent Features

Memory
textual episodic memory logs per agent
Planning
day-long goal + subtasks generatorshort-horizon conversational planning
Tool Use
location-specific action primitivesphone actions (call, message, order)
Frameworks
graph-structured town (locations as nodes)Conformal Language Modeling (CLM) for verifier uncertainty
Is Agentic

Yes

Architectures
LLM-based agent policies (decoder-only LLMs)
Collaboration
supporting agents that exert social pressure and dialogue

Optimization Features

Infra Optimization
recommendation to report compute budgets; large number of runs is computationally costly
Inference Optimization
use of vLLM for open-source model inference (described)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

CultureBank (cited as source of norms) - referenced but no URL provided in text

Risks & Boundaries

Limitations

Coverage is limited to the cultures and norms available in CultureBank and the chosen census marginals.

The LLM-based verifier, even with conformal sampling, cannot fully replace human judgment in ambiguous cultural cases.

When Not To Use

As a definitive measure of real people’s cultural correctness or as legal/HR evidence.

When you cannot afford human oversight for flagged low-confidence cases.

Failure Modes

Verifier bias or sampling sensitivity leading to false positives/negatives on norm violations.

Reifying simplified cultural rules and reinforcing stereotypes.

Core Entities

Models

Gemini-2.5-ProGemini-2.5-FlashGemini-3-ProQwen3-8BQwen3-14BQwen3-32BMinistral-3-8B-ReasoningMinistral-3-14B-ReasoningLlama-3-8BLlama-3-70B

Metrics

Goal CompletionNorm AdherenceFaithfulness to ProfileContextual AwarenessCoherence

Datasets

CultureBank (cited as source of cultural norms)Australian Census marginals (used for profile sampling)

Benchmarks

CDEvalNormAd