Overview
Production Readiness
0.5
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
13
Why It Matters For Business
AgentSims helps teams test language models in realistic, multi-step roles (e.g., mayor, employee). That reveals operational gaps not visible with static benchmarks and speeds prototyping for productized agents.
Summary TLDR
AgentSims is an open, easy-to-use sandbox for evaluating large language models (LLMs) by putting them inside simulated towns as agents. It offers a graphical 'User Mode' for non-experts and a modular 'Developer Mode' for swapping support systems (planning, memory, tool-use). The platform favors task-based pass/fail metrics over text-similarity scores and aims to reduce benchmark leakage and judge bias. Demo available at https://agentsims.com. No quantitative evaluation of models is reported in the paper.
Problem Statement
Current LLM benchmarks are narrow (mostly single-turn QA), easy to leak into training data, and rely on subjective or model-based graders. We need a reproducible, extensible way to test broad, interactive skills such as long-term planning, social adaptation and tool use.
Main Contribution
A hybrid GUI + programmatic sandbox (AgentSims) to create multi-agent task scenarios for LLM evaluation.
A modular agent architecture that separates LLM core from three support systems: Planning, Memory, and Tool-Use.
Two interaction modes: User Mode for low-barrier experiment design and Developer Mode for swapping or implementing support systems.
Open demo and implementation notes (Python backend, Unity WebGL frontend) to lower the barrier for cross-disciplinary use.
Key Findings
Task-based evaluation reduces hackability, broadens tested abilities, and yields an objective pass rate.
AgentSims exposes three pluggable support systems: Planning, Memory (vector DB embeddings), and Tool-Use.
AgentSims provides two user paths: a pixel-style GUI for non-experts and a code-first Developer Mode with simple class hooks.
Existing broad benchmarks include over 200 tasks and still miss interactive, long-term behaviors.
AgentSims cannot fully reflect real-world complexity and struggles to measure fine-grained skills like math.
Who Should Care
What To Try In 7 Days
Run the demo at https://agentsims.com and explore existing scenarios.
Create a simple agent in User Mode to test dialogue and persistence.
Swap built-in memory or planning modules in Developer Mode and compare pass rates for a 3-step task.
Agent Features
Memory
- vector DB embeddings
- retrieval for context and relationship recall
Planning
- pluggable prompt-based planning
- goal decomposition into subtasks
Tool Use
- equipment-operation pairs
- skill storage from operation feedback
Frameworks
- LLMCaller abstraction for model calls
- JSON-configured buildings and equipment
Is Agentic
true
Architectures
- multi-agent sandbox
- generative agents (LLM-driven)
Collaboration
- multi-agent interaction and emergent behavior observation
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Simulation fidelity depends on LLM accuracy; it cannot fully mirror real-world complexity.
- Task pass/fail rates do not explain why models succeed or fail.
- Not suitable for fine-grained skills like precise arithmetic or formal logic.
When Not To Use
- When you need exact, low-level measurements (math, symbolic reasoning).
- When real-world deployment fidelity is required without LLM-imposed artifacts.
- When you need benchmark results with widely accepted numeric baselines.
Failure Modes
- Agents hallucinate or give inconsistent behaviors due to LLM limits.
- Emergent behaviors vary with support-system choices, hurting reproducibility.
- Benchmarks can still be biased if scenario design inadvertently matches training data.
Core Entities
Models
- GPT-4
- ChatGPT
Metrics
- task pass rate
Benchmarks
- BIG-bench
Context Entities
Models
- GPT-4
- GPT-like LLMs
Metrics
- pass rate
- human ratings
- LLM-as-rater
Benchmarks
- BIG-bench

