AgentSims: a visual, multi-agent sandbox to build task-based LLM benchmarks quickly

August 8, 20236 min

Overview

Decision SnapshotNeeds Validation

The system is a practical, modular sandbox useful for prototyping and small studies, but the paper provides no quantitative experiments comparing models.

Citations13

Evidence Strength0.40

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 1/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 50%

Authors

Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, Qin Chen

Links

Abstract / PDF / Code

Why It Matters For Business

AgentSims helps teams test language models in realistic, multi-step roles (e.g., mayor, employee). That reveals operational gaps not visible with static benchmarks and speeds prototyping for productized agents.

Who Should Care

Summary TLDR

AgentSims is an open, easy-to-use sandbox for evaluating large language models (LLMs) by putting them inside simulated towns as agents. It offers a graphical 'User Mode' for non-experts and a modular 'Developer Mode' for swapping support systems (planning, memory, tool-use). The platform favors task-based pass/fail metrics over text-similarity scores and aims to reduce benchmark leakage and judge bias. Demo available at https://agentsims.com. No quantitative evaluation of models is reported in the paper.

Problem Statement

Current LLM benchmarks are narrow (mostly single-turn QA), easy to leak into training data, and rely on subjective or model-based graders. We need a reproducible, extensible way to test broad, interactive skills such as long-term planning, social adaptation and tool use.

Main Contribution

A hybrid GUI + programmatic sandbox (AgentSims) to create multi-agent task scenarios for LLM evaluation.

A modular agent architecture that separates LLM core from three support systems: Planning, Memory, and Tool-Use.

Key Findings

Task-based evaluation reduces hackability, broadens tested abilities, and yields an objective pass rate.

Practical UsePrefer task pass/fail scenarios when you need objective, end-to-end evidence of an LLM's real-world-style skills.

Evidence RefAbstract; Introduction (sec 1)

AgentSims exposes three pluggable support systems: Planning, Memory (vector DB embeddings), and Tool-Use.

Practical UseUse the sandbox to measure how swapping memory or planning modules changes agent behavior.

Evidence RefSection 3.1 (Generative Agents)

What To Try In 7 Days

Run the demo at https://agentsims.com and explore existing scenarios.

Create a simple agent in User Mode to test dialogue and persistence.

Swap built-in memory or planning modules in Developer Mode and compare pass rates for a 3-step task.

Agent Features

Memory
vector DB embeddingsretrieval for context and relationship recall
Planning
pluggable prompt-based planninggoal decomposition into subtasks
Tool Use
equipment-operation pairsskill storage from operation feedback
Frameworks
LLMCaller abstraction for model callsJSON-configured buildings and equipment
Is Agentic

Yes

Architectures
multi-agent sandboxgenerative agents (LLM-driven)
Collaboration
multi-agent interaction and emergent behavior observation

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Simulation fidelity depends on LLM accuracy; it cannot fully mirror real-world complexity.

Task pass/fail rates do not explain why models succeed or fail.

When Not To Use

When you need exact, low-level measurements (math, symbolic reasoning).

When real-world deployment fidelity is required without LLM-imposed artifacts.

Failure Modes

Agents hallucinate or give inconsistent behaviors due to LLM limits.

Emergent behaviors vary with support-system choices, hurting reproducibility.

Core Entities

Models

GPT-4ChatGPT

Metrics

task pass rate

Benchmarks

BIG-bench

Context Entities

Models

GPT-4GPT-like LLMs

Metrics

pass ratehuman ratingsLLM-as-rater

Benchmarks

BIG-bench