A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

March 2, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Viet-Thanh Pham, Lizhen Qu, Thuy-Trang Vu, Gholamreza Haffari, Dinh Phung

Links

Abstract / PDF

Why It Matters For Business

If you deploy LLMs as agents that interact with people, you must test them in culturally diverse, multi-party settings; naive models can break local norms while achieving tasks, exposing reputational and compliance risks.

Summary TLDR

LiveCultureBench is a runtime benchmark that places LLMs as agents in a simulated small city and scores both task success and adherence to socio-cultural norms. The environment uses realistic demographic sampling, location-conditioned norms (from CultureBank), supporting agents that apply social pressure, and an LLM-based verifier that is calibrated with conformal sampling. Results show consistent cross-cultural gaps, norm adherence drops as cultural diversity rises, models often prioritize task completion over norms, and an uncertainty-aware verifier can control verification risk but cannot fully replace human oversight.

Problem Statement

Current LLM evaluations focus on task success or static cultural prompts. They miss how cultural misalignment appears when models act as agents over time, and they rely on LLM judges without quantifying judge reliability. This paper builds a dynamic, multi-cultural simulation and measures both task completion and norm adherence, plus verifier trustworthiness.

Main Contribution

A modular, goal-driven social simulation (small town graph) that places LLMs as target and supporting agents and checks location-conditioned socio-cultural norms.

An evaluation protocol and time-aggregated metrics that jointly measure goal completion, norm violations, profile faithfulness, context awareness, and coherence.

A verifier-as-an-object-of-study: an LLM-based verifier calibrated with conformal sampling to estimate verification risk and identify when human oversight is needed.

Key Findings

An LLM-based verifier achieves high task-level agreement with humans on held-out test sets.

NumbersGoal Completion F1 92.41; Contextual Awareness F1 95.27; Coherence F1 96.08

Norm adherence often falls as cultural diversity in the scene increases.

NumbersGemini-2.5-Pro Norm Adherence drops 0.95 -> 0.85 (Δ −0.10) when moving from single-culture to all cultures

Models tend to prioritize goal completion over following local norms.

NumbersGoal Completion changes little with more cultures while Norm Adherence declines (see Table 13 vs Table 14)

Norm adherence varies by location: private and low-stakes places are easier to follow than public, multi-actor places.

NumbersGemini-2.5-Pro: Apartment 0.78, Park 0.78 vs Office 0.62, Restaurant 0.59

Conformal sampling makes the verifier’s candidate sets obey risk targets; Gemini 3 Pro achieves low set loss (~0.14) in experiments.

NumbersVerifier set-loss ≈0.14 for Gemini 3 Pro across risk settings (0.05–0.35)

Results

Verifier F1 (Goal Completion)

Value92.41

Verifier F1 (Norm Violation)

Value89.36

Norm Adherence (single-culture) - Gemini-2.5-Pro

Value0.95

Norm Adherence (all cultures) - Gemini-2.5-Pro

Value0.85

Baseline0.95 (single-culture)

Location difference (Norm Adherence) - Gemini-2.5-Pro

ValueApartment 0.78; Park 0.78; Office 0.62; Restaurant 0.59

Verifier conformal set-loss (Gemini 3 Pro)

Value≈0.14

Who Should Care

What To Try In 7 Days

Run the model in LiveCultureBench or similar multi-agent scenarios for a handful of personas to spot norm violations before release.

Add verifier-calibrated checks (conformal sampling) to flag low-confidence behavior and route those cases to human review.

Introduce simple cost terms or rule checks that penalize norm violations during decision-making or fine-tuning.

Agent Features

Memory

  • textual episodic memory logs per agent

Planning

  • day-long goal + subtasks generator
  • short-horizon conversational planning

Tool Use

  • location-specific action primitives
  • phone actions (call, message, order)

Frameworks

  • graph-structured town (locations as nodes)
  • Conformal Language Modeling (CLM) for verifier uncertainty

Is Agentic

true

Architectures

  • LLM-based agent policies (decoder-only LLMs)

Collaboration

  • supporting agents that exert social pressure and dialogue

Optimization Features

Infra Optimization

  • recommendation to report compute budgets; large number of runs is computationally costly

Inference Optimization

  • use of vLLM for open-source model inference (described)

Reproducibility

Data Urls

  • CultureBank (cited as source of norms) - referenced but no URL provided in text

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Coverage is limited to the cultures and norms available in CultureBank and the chosen census marginals.
  • The LLM-based verifier, even with conformal sampling, cannot fully replace human judgment in ambiguous cultural cases.
  • Simulation simplifies culture as location-conditioned rules; this can reify stereotypes or miss intra-cultural variation.

When Not To Use

  • As a definitive measure of real people’s cultural correctness or as legal/HR evidence.
  • When you cannot afford human oversight for flagged low-confidence cases.
  • To optimize agents for social manipulation or exploitation of cultural norms.

Failure Modes

  • Verifier bias or sampling sensitivity leading to false positives/negatives on norm violations.
  • Reifying simplified cultural rules and reinforcing stereotypes.
  • Models prioritizing efficiency and task completion over safe, norm-respecting behavior.

Core Entities

Models

  • Gemini-2.5-Pro
  • Gemini-2.5-Flash
  • Gemini-3-Pro
  • Qwen3-8B
  • Qwen3-14B
  • Qwen3-32B
  • Ministral-3-8B-Reasoning
  • Ministral-3-14B-Reasoning
  • Llama-3-8B
  • Llama-3-70B

Metrics

  • Goal Completion
  • Norm Adherence
  • Faithfulness to Profile
  • Contextual Awareness
  • Coherence

Datasets

  • CultureBank (cited as source of cultural norms)
  • Australian Census marginals (used for profile sampling)

Benchmarks

  • CDEval
  • NormAd