CityEQA-EC benchmark plus PMA: a hierarchical LLM agent that explores simulated cities to answer open‑vocabulary questions

February 18, 20257 min

Overview

Decision SnapshotNeeds Validation

The paper provides a clear benchmark and a working hierarchical agent with ablations; results are solid in simulation but rely on external LLM/VLM APIs and limited dynamic-event coverage.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 35%

Production readiness: 30%

Novelty: 65%

Authors

Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, Jincai Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CityEQA-EC and PMA provide a practical testbed for building drone/UAV perception and urban-inspection agents that use language-guided planning and map memory, reducing search time and distance vs naive exploration.

Who Should Care

Summary TLDR

This paper introduces CityEQA-EC, the first open-ended embodied question answering benchmark for realistic city scenes (1,412 validated tasks). It also proposes PMA, a Planner–Manager–Actor agent that uses LLMs for planning, a Vision-Language Model and GroundSAM for perception, and an object-centric 2D cognitive map for memory. On a 200-task test, PMA scores QAA 3.00±1.96 (60.7% of human EQA accuracy) while cutting navigation error and time steps vs standard exploration baselines. The dataset, code, and ablations show the map, navigator, explorer, and collector modules matter most. PMA still lags humans on visual reasoning and ignores dynamic/social events.

Problem Statement

Embodied Question Answering (EQA) has focused on indoor scenes. City environments are larger, visually ambiguous, and have view-dependent observations. We need agents that plan long-horizon exploration, use landmarks and spatial relations, and convert visual inputs into accurate open-vocabulary answers.

Main Contribution

CityEQA-EC: a validated benchmark of 1,412 open‑vocabulary EQA tasks in a realistic 3D city simulator.

PMA: a hierarchical Planner–Manager–Actor agent using LLMs for planning, GroundSAM/VLM for perception, and an object-centric 2D cognitive map for long-term memory.

Key Findings

CityEQA-EC contains 1,412 validated tasks across six task types.

Numbers1,412 tasks (final dataset)

Practical UseYou can benchmark urban embodied agents on diverse, human-validated city questions rather than small indoor sets.

Evidence RefSection 2, Dataset validation

PMA achieves QAA 3.00±1.96 vs human H-EQA 4.94±0.21, equal to 60.73% of human accuracy.

NumbersPMA QAA 3.00±1.96; H-EQA 4.94±0.21; 60.73%

Practical UseHierarchical planning plus map memory improves answering, but visual reasoning still needs work to match humans.

Evidence RefTable 2, Section 4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
QAA (1-5)3.00±1.96PMA (ours)vs H-EQA 4.94±0.21 (60.73% human)CityEQA-EC (200-task eval sample)Table 2; Section 4.2Table 2
QAA (1-5)4.94±0.21H-EQA (human)CityEQA-ECTable 2; Section 4.2Table 2

What To Try In 7 Days

Download CityEQA-EC and run PMA on a small set to inspect map outputs and trajectories.

Replace the VLM (GPT-4o) with your VLM to compare visual recognition quality quickly.

Run the PMA ablation without the map to see how persistent memory affects efficiency.

Agent Features

Memory
Object-centric cognitive map (2D grids, merged over time)Req_info and History memory modules
Planning
LLM-driven Planner using few-shot Chain-of-ThoughtLoRA
Tool Use
VLM for visual Q&A and action selection (Collector)GroundSAM for grounding/segmentationA* for path planning
Frameworks
EmbodiedCity (Unreal Engine 4 + AirSim)GroundSAMGPT-4o/GPT-4 as VLM/LLM
Is Agentic

Yes

Architectures
Planner–Manager–Actor hierarchyObject-centric 2D grid cognitive map

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Focuses on object-centric, static tasks; dynamic events and social interactions not covered.

Evaluation uses a 200-task sampled subset due to API limits, not the full dataset.

When Not To Use

When the task requires temporal reasoning or detection of dynamic events (traffic jams, crowds).

If you need a fully open-source stack and cannot use closed LLM/VLM APIs.

Failure Modes

Map merging errors causing landmark misidentification and wrong navigation targets.

Collector overadjustment that degrades image quality and lowers QAA after many steps.

Core Entities

Models

GPT-4oGPT-4Qwen-2.5LLaMA-v3.1-8bDeepSeek-v3GroundSAMLaV A-style VLMs (GPT-4o used as VLM in experiments)

Metrics

AccuracyNavigation Error (NE)Mean Time Step (MTS)

Datasets

CityEQA-ECEmbodiedCity (simulator)

Benchmarks

CityEQA-EC

Context Entities

Models

OpenEQA baselines (FBE)LoRA

Datasets

City-3DQA, EarthVQA, Open3DVQA (related outdoor QA datasets)