AgentSims: a visual, multi-agent sandbox to build task-based LLM benchmarks quickly

August 8, 20236 min

Overview

Production Readiness

0.5

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

13

Authors

Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, Qin Chen

Links

Abstract / PDF

Why It Matters For Business

AgentSims helps teams test language models in realistic, multi-step roles (e.g., mayor, employee). That reveals operational gaps not visible with static benchmarks and speeds prototyping for productized agents.

Summary TLDR

AgentSims is an open, easy-to-use sandbox for evaluating large language models (LLMs) by putting them inside simulated towns as agents. It offers a graphical 'User Mode' for non-experts and a modular 'Developer Mode' for swapping support systems (planning, memory, tool-use). The platform favors task-based pass/fail metrics over text-similarity scores and aims to reduce benchmark leakage and judge bias. Demo available at https://agentsims.com. No quantitative evaluation of models is reported in the paper.

Problem Statement

Current LLM benchmarks are narrow (mostly single-turn QA), easy to leak into training data, and rely on subjective or model-based graders. We need a reproducible, extensible way to test broad, interactive skills such as long-term planning, social adaptation and tool use.

Main Contribution

A hybrid GUI + programmatic sandbox (AgentSims) to create multi-agent task scenarios for LLM evaluation.

A modular agent architecture that separates LLM core from three support systems: Planning, Memory, and Tool-Use.

Two interaction modes: User Mode for low-barrier experiment design and Developer Mode for swapping or implementing support systems.

Open demo and implementation notes (Python backend, Unity WebGL frontend) to lower the barrier for cross-disciplinary use.

Key Findings

Task-based evaluation reduces hackability, broadens tested abilities, and yields an objective pass rate.

AgentSims exposes three pluggable support systems: Planning, Memory (vector DB embeddings), and Tool-Use.

AgentSims provides two user paths: a pixel-style GUI for non-experts and a code-first Developer Mode with simple class hooks.

Existing broad benchmarks include over 200 tasks and still miss interactive, long-term behaviors.

NumbersBIG-bench >200 tasks

AgentSims cannot fully reflect real-world complexity and struggles to measure fine-grained skills like math.

Who Should Care

What To Try In 7 Days

Run the demo at https://agentsims.com and explore existing scenarios.

Create a simple agent in User Mode to test dialogue and persistence.

Swap built-in memory or planning modules in Developer Mode and compare pass rates for a 3-step task.

Agent Features

Memory

  • vector DB embeddings
  • retrieval for context and relationship recall

Planning

  • pluggable prompt-based planning
  • goal decomposition into subtasks

Tool Use

  • equipment-operation pairs
  • skill storage from operation feedback

Frameworks

  • LLMCaller abstraction for model calls
  • JSON-configured buildings and equipment

Is Agentic

true

Architectures

  • multi-agent sandbox
  • generative agents (LLM-driven)

Collaboration

  • multi-agent interaction and emergent behavior observation

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Simulation fidelity depends on LLM accuracy; it cannot fully mirror real-world complexity.
  • Task pass/fail rates do not explain why models succeed or fail.
  • Not suitable for fine-grained skills like precise arithmetic or formal logic.

When Not To Use

  • When you need exact, low-level measurements (math, symbolic reasoning).
  • When real-world deployment fidelity is required without LLM-imposed artifacts.
  • When you need benchmark results with widely accepted numeric baselines.

Failure Modes

  • Agents hallucinate or give inconsistent behaviors due to LLM limits.
  • Emergent behaviors vary with support-system choices, hurting reproducibility.
  • Benchmarks can still be biased if scenario design inadvertently matches training data.

Core Entities

Models

  • GPT-4
  • ChatGPT

Metrics

  • task pass rate

Benchmarks

  • BIG-bench

Context Entities

Models

  • GPT-4
  • GPT-like LLMs

Metrics

  • pass rate
  • human ratings
  • LLM-as-rater

Benchmarks

  • BIG-bench