Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
If you deploy LLMs as agents that interact with people, you must test them in culturally diverse, multi-party settings; naive models can break local norms while achieving tasks, exposing reputational and compliance risks.
Summary TLDR
LiveCultureBench is a runtime benchmark that places LLMs as agents in a simulated small city and scores both task success and adherence to socio-cultural norms. The environment uses realistic demographic sampling, location-conditioned norms (from CultureBank), supporting agents that apply social pressure, and an LLM-based verifier that is calibrated with conformal sampling. Results show consistent cross-cultural gaps, norm adherence drops as cultural diversity rises, models often prioritize task completion over norms, and an uncertainty-aware verifier can control verification risk but cannot fully replace human oversight.
Problem Statement
Current LLM evaluations focus on task success or static cultural prompts. They miss how cultural misalignment appears when models act as agents over time, and they rely on LLM judges without quantifying judge reliability. This paper builds a dynamic, multi-cultural simulation and measures both task completion and norm adherence, plus verifier trustworthiness.
Main Contribution
A modular, goal-driven social simulation (small town graph) that places LLMs as target and supporting agents and checks location-conditioned socio-cultural norms.
An evaluation protocol and time-aggregated metrics that jointly measure goal completion, norm violations, profile faithfulness, context awareness, and coherence.
A verifier-as-an-object-of-study: an LLM-based verifier calibrated with conformal sampling to estimate verification risk and identify when human oversight is needed.
Key Findings
An LLM-based verifier achieves high task-level agreement with humans on held-out test sets.
Norm adherence often falls as cultural diversity in the scene increases.
Models tend to prioritize goal completion over following local norms.
Norm adherence varies by location: private and low-stakes places are easier to follow than public, multi-actor places.
Conformal sampling makes the verifier’s candidate sets obey risk targets; Gemini 3 Pro achieves low set loss (~0.14) in experiments.
Results
Verifier F1 (Goal Completion)
Verifier F1 (Norm Violation)
Norm Adherence (single-culture) - Gemini-2.5-Pro
Norm Adherence (all cultures) - Gemini-2.5-Pro
Location difference (Norm Adherence) - Gemini-2.5-Pro
Verifier conformal set-loss (Gemini 3 Pro)
Who Should Care
What To Try In 7 Days
Run the model in LiveCultureBench or similar multi-agent scenarios for a handful of personas to spot norm violations before release.
Add verifier-calibrated checks (conformal sampling) to flag low-confidence behavior and route those cases to human review.
Introduce simple cost terms or rule checks that penalize norm violations during decision-making or fine-tuning.
Agent Features
Memory
- textual episodic memory logs per agent
Planning
- day-long goal + subtasks generator
- short-horizon conversational planning
Tool Use
- location-specific action primitives
- phone actions (call, message, order)
Frameworks
- graph-structured town (locations as nodes)
- Conformal Language Modeling (CLM) for verifier uncertainty
Is Agentic
true
Architectures
- LLM-based agent policies (decoder-only LLMs)
Collaboration
- supporting agents that exert social pressure and dialogue
Optimization Features
Infra Optimization
- recommendation to report compute budgets; large number of runs is computationally costly
Inference Optimization
- use of vLLM for open-source model inference (described)
Reproducibility
Data Urls
- CultureBank (cited as source of norms) - referenced but no URL provided in text
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Coverage is limited to the cultures and norms available in CultureBank and the chosen census marginals.
- The LLM-based verifier, even with conformal sampling, cannot fully replace human judgment in ambiguous cultural cases.
- Simulation simplifies culture as location-conditioned rules; this can reify stereotypes or miss intra-cultural variation.
When Not To Use
- As a definitive measure of real people’s cultural correctness or as legal/HR evidence.
- When you cannot afford human oversight for flagged low-confidence cases.
- To optimize agents for social manipulation or exploitation of cultural norms.
Failure Modes
- Verifier bias or sampling sensitivity leading to false positives/negatives on norm violations.
- Reifying simplified cultural rules and reinforcing stereotypes.
- Models prioritizing efficiency and task completion over safe, norm-respecting behavior.
Core Entities
Models
- Gemini-2.5-Pro
- Gemini-2.5-Flash
- Gemini-3-Pro
- Qwen3-8B
- Qwen3-14B
- Qwen3-32B
- Ministral-3-8B-Reasoning
- Ministral-3-14B-Reasoning
- Llama-3-8B
- Llama-3-70B
Metrics
- Goal Completion
- Norm Adherence
- Faithfulness to Profile
- Contextual Awareness
- Coherence
Datasets
- CultureBank (cited as source of cultural norms)
- Australian Census marginals (used for profile sampling)
Benchmarks
- CDEval
- NormAd

