Overview
The benchmark is ready for in-house stress testing and researcher use; expect nontrivial compute costs and the need for human review of low-confidence verifier outputs.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/6
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
If you deploy LLMs as agents that interact with people, you must test them in culturally diverse, multi-party settings; naive models can break local norms while achieving tasks, exposing reputational and compliance risks.
Who Should Care
Summary TLDR
LiveCultureBench is a runtime benchmark that places LLMs as agents in a simulated small city and scores both task success and adherence to socio-cultural norms. The environment uses realistic demographic sampling, location-conditioned norms (from CultureBank), supporting agents that apply social pressure, and an LLM-based verifier that is calibrated with conformal sampling. Results show consistent cross-cultural gaps, norm adherence drops as cultural diversity rises, models often prioritize task completion over norms, and an uncertainty-aware verifier can control verification risk but cannot fully replace human oversight.
Problem Statement
Current LLM evaluations focus on task success or static cultural prompts. They miss how cultural misalignment appears when models act as agents over time, and they rely on LLM judges without quantifying judge reliability. This paper builds a dynamic, multi-cultural simulation and measures both task completion and norm adherence, plus verifier trustworthiness.
Main Contribution
A modular, goal-driven social simulation (small town graph) that places LLMs as target and supporting agents and checks location-conditioned socio-cultural norms.
An evaluation protocol and time-aggregated metrics that jointly measure goal completion, norm violations, profile faithfulness, context awareness, and coherence.
Key Findings
An LLM-based verifier achieves high task-level agreement with humans on held-out test sets.
Norm adherence often falls as cultural diversity in the scene increases.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Verifier F1 (Goal Completion) | 92.41 | — | — | 200 human-annotated test samples | Table 1 reports F1=92.41 for Goal Completion | Table 1 |
| Verifier F1 (Norm Violation) | 89.36 | — | — | 200 human-annotated test samples | Table 1 reports F1=89.36 for Norm Violation | Table 1 |
What To Try In 7 Days
Run the model in LiveCultureBench or similar multi-agent scenarios for a handful of personas to spot norm violations before release.
Add verifier-calibrated checks (conformal sampling) to flag low-confidence behavior and route those cases to human review.
Introduce simple cost terms or rule checks that penalize norm violations during decision-making or fine-tuning.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Infra Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Coverage is limited to the cultures and norms available in CultureBank and the chosen census marginals.
The LLM-based verifier, even with conformal sampling, cannot fully replace human judgment in ambiguous cultural cases.
When Not To Use
As a definitive measure of real people’s cultural correctness or as legal/HR evidence.
When you cannot afford human oversight for flagged low-confidence cases.
Failure Modes
Verifier bias or sampling sensitivity leading to false positives/negatives on norm violations.
Reifying simplified cultural rules and reinforcing stereotypes.

