CityEQA-EC benchmark plus PMA: a hierarchical LLM agent that explores simulated cities to answer open‑vocabulary questions

Overview

Decision SnapshotNeeds Validation

The paper provides a clear benchmark and a working hierarchical agent with ablations; results are solid in simulation but rely on external LLM/VLM APIs and limited dynamic-event coverage.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 35%

Production readiness: 30%

Novelty: 65%

Authors

Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, Jincai Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CityEQA-EC and PMA provide a practical testbed for building drone/UAV perception and urban-inspection agents that use language-guided planning and map memory, reducing search time and distance vs naive exploration.

Who Should Care

ML Engineer Product Manager Engineering Lead Data Scientist CTO

Summary TLDR

This paper introduces CityEQA-EC, the first open-ended embodied question answering benchmark for realistic city scenes (1,412 validated tasks). It also proposes PMA, a Planner–Manager–Actor agent that uses LLMs for planning, a Vision-Language Model and GroundSAM for perception, and an object-centric 2D cognitive map for memory. On a 200-task test, PMA scores QAA 3.00±1.96 (60.7% of human EQA accuracy) while cutting navigation error and time steps vs standard exploration baselines. The dataset, code, and ablations show the map, navigator, explorer, and collector modules matter most. PMA still lags humans on visual reasoning and ignores dynamic/social events.

Problem Statement

Embodied Question Answering (EQA) has focused on indoor scenes. City environments are larger, visually ambiguous, and have view-dependent observations. We need agents that plan long-horizon exploration, use landmarks and spatial relations, and convert visual inputs into accurate open-vocabulary answers.

Main Contribution

CityEQA-EC: a validated benchmark of 1,412 open‑vocabulary EQA tasks in a realistic 3D city simulator.

PMA: a hierarchical Planner–Manager–Actor agent using LLMs for planning, GroundSAM/VLM for perception, and an object-centric 2D cognitive map for long-term memory.

Key Findings

CityEQA-EC contains 1,412 validated tasks across six task types.

Numbers1,412 tasks (final dataset)

Practical UseYou can benchmark urban embodied agents on diverse, human-validated city questions rather than small indoor sets.

Evidence RefSection 2, Dataset validation

PMA achieves QAA 3.00±1.96 vs human H-EQA 4.94±0.21, equal to 60.73% of human accuracy.

NumbersPMA QAA 3.00±1.96; H-EQA 4.94±0.21; 60.73%

Practical UseHierarchical planning plus map memory improves answering, but visual reasoning still needs work to match humans.

Evidence RefTable 2, Section 4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
QAA (1-5)	3.00±1.96	PMA (ours)	vs H-EQA 4.94±0.21 (60.73% human)	CityEQA-EC (200-task eval sample)	Table 2; Section 4.2	Table 2
QAA (1-5)	4.94±0.21	H-EQA (human)	—	CityEQA-EC	Table 2; Section 4.2	Table 2

What To Try In 7 Days

Download CityEQA-EC and run PMA on a small set to inspect map outputs and trajectories.

Replace the VLM (GPT-4o) with your VLM to compare visual recognition quality quickly.

Run the PMA ablation without the map to see how persistent memory affects efficiency.

Agent Features

Memory

Object-centric cognitive map (2D grids, merged over time)Req_info and History memory modules

Planning

LLM-driven Planner using few-shot Chain-of-ThoughtLoRA

Tool Use

VLM for visual Q&A and action selection (Collector)GroundSAM for grounding/segmentationA* for path planning

Frameworks

EmbodiedCity (Unreal Engine 4 + AirSim)GroundSAMGPT-4o/GPT-4 as VLM/LLM

Is Agentic

Yes

Architectures

Planner–Manager–Actor hierarchyObject-centric 2D grid cognitive map

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://anonymous.4open.science/r/CityEQA-3027

Data URLs

https://anonymous.4open.science/r/CityEQA-3027

Risks & Boundaries

Limitations

Focuses on object-centric, static tasks; dynamic events and social interactions not covered.

Evaluation uses a 200-task sampled subset due to API limits, not the full dataset.

When Not To Use

When the task requires temporal reasoning or detection of dynamic events (traffic jams, crowds).

If you need a fully open-source stack and cannot use closed LLM/VLM APIs.

Failure Modes

Map merging errors causing landmark misidentification and wrong navigation targets.

Collector overadjustment that degrades image quality and lowers QAA after many steps.

Core Entities

Models

GPT-4oGPT-4Qwen-2.5LLaMA-v3.1-8bDeepSeek-v3GroundSAMLaV A-style VLMs (GPT-4o used as VLM in experiments)

Metrics

AccuracyNavigation Error (NE)Mean Time Step (MTS)

Datasets

CityEQA-ECEmbodiedCity (simulator)

Benchmarks

CityEQA-EC

Context Entities

Models

OpenEQA baselines (FBE)LoRA

Datasets

City-3DQA, EarthVQA, Open3DVQA (related outdoor QA datasets)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CityEQA-EC contains 1,412 validated tasks across six task types.

PMA achieves QAA 3.00±1.96 vs human H-EQA 4.94±0.21, equal to 60.73% of human accuracy.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Close the Intent–Execution Gap by compiling a creator's 'Vibe' into multi-agent workflows

Key finding

Search LLM agents faster: jointly search workflows plus memory, planning and tool modules with a learned performance model

Key finding

Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

Key finding

Use modal logic + Kripke belief states to constrain LMs and produce verifiable autonomous diagnostics

Key finding

G-Memory: a plug‑in three-tier graph memory that helps multi-agent teams learn from past collaborations

Key finding