CityEQA-EC benchmark plus PMA: a hierarchical LLM agent that explores simulated cities to answer open‑vocabulary questions

February 18, 20257 min

Overview

Production Readiness

0.3

Novelty Score

0.65

Cost Impact Score

0.35

Citation Count

0

Authors

Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, Jincai Huang

Links

Abstract / PDF

Why It Matters For Business

CityEQA-EC and PMA provide a practical testbed for building drone/UAV perception and urban-inspection agents that use language-guided planning and map memory, reducing search time and distance vs naive exploration.

Summary TLDR

This paper introduces CityEQA-EC, the first open-ended embodied question answering benchmark for realistic city scenes (1,412 validated tasks). It also proposes PMA, a Planner–Manager–Actor agent that uses LLMs for planning, a Vision-Language Model and GroundSAM for perception, and an object-centric 2D cognitive map for memory. On a 200-task test, PMA scores QAA 3.00±1.96 (60.7% of human EQA accuracy) while cutting navigation error and time steps vs standard exploration baselines. The dataset, code, and ablations show the map, navigator, explorer, and collector modules matter most. PMA still lags humans on visual reasoning and ignores dynamic/social events.

Problem Statement

Embodied Question Answering (EQA) has focused on indoor scenes. City environments are larger, visually ambiguous, and have view-dependent observations. We need agents that plan long-horizon exploration, use landmarks and spatial relations, and convert visual inputs into accurate open-vocabulary answers.

Main Contribution

CityEQA-EC: a validated benchmark of 1,412 open‑vocabulary EQA tasks in a realistic 3D city simulator.

PMA: a hierarchical Planner–Manager–Actor agent using LLMs for planning, GroundSAM/VLM for perception, and an object-centric 2D cognitive map for long-term memory.

Empirical evaluation and ablations showing PMA outperforms blind, VQA, Socratic, and naive exploring baselines and revealing key module contributions.

Key Findings

CityEQA-EC contains 1,412 validated tasks across six task types.

Numbers1,412 tasks (final dataset)

PMA achieves QAA 3.00±1.96 vs human H-EQA 4.94±0.21, equal to 60.73% of human accuracy.

NumbersPMA QAA 3.00±1.96; H-EQA 4.94±0.21; 60.73%

PMA reduces navigation error and time steps compared to frontier/random exploring agents.

NumbersPMA NE 46.56±36.39m, MTS 24.44±14.39 vs FEB NE 86.92±53.71m, MTS 39.31±32.17

Removing the object-centric map greatly harms performance.

NumbersQAA drops from 3.00±1.96 to 2.31±1.82 when map removed

Collector (view fine-tuning) improves accuracy with steps but can plateau or slightly degrade if overadjusted.

NumbersQAA rises with collector steps; Step 10 slightly lower than Step 9

LLM-based automatic scoring aligns with humans (Spearman R_s = 0.85).

NumbersR_s = 0.85, p = 0.002 on 100 samples

Results

QAA (1-5)

Value3.00±1.96

BaselinePMA (ours)

QAA (1-5)

Value4.94±0.21

BaselineH-EQA (human)

Navigation Error (m)

Value46.56±36.39

BaselinePMA (ours)

Mean Time Step (steps)

Value24.44±14.39

BaselinePMA (ours)

QAA ablation (no map)

Value2.31±1.82

BaselinePMA w/o map

Who Should Care

What To Try In 7 Days

Download CityEQA-EC and run PMA on a small set to inspect map outputs and trajectories.

Replace the VLM (GPT-4o) with your VLM to compare visual recognition quality quickly.

Run the PMA ablation without the map to see how persistent memory affects efficiency.

Agent Features

Memory

  • Object-centric cognitive map (2D grids, merged over time)
  • Req_info and History memory modules

Planning

  • LLM-driven Planner using few-shot Chain-of-Thought
  • LoRA

Tool Use

  • VLM for visual Q&A and action selection (Collector)
  • GroundSAM for grounding/segmentation
  • A* for path planning

Frameworks

  • EmbodiedCity (Unreal Engine 4 + AirSim)
  • GroundSAM
  • GPT-4o/GPT-4 as VLM/LLM

Is Agentic

true

Architectures

  • Planner–Manager–Actor hierarchy
  • Object-centric 2D grid cognitive map

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focuses on object-centric, static tasks; dynamic events and social interactions not covered.
  • Evaluation uses a 200-task sampled subset due to API limits, not the full dataset.
  • Relies on external closed-source VLM/LLM (GPT-4o/GPT-4) which may limit reproducibility and cost.
  • Simulation-to-real transfer is not addressed; real-world conditions may reduce performance.

When Not To Use

  • When the task requires temporal reasoning or detection of dynamic events (traffic jams, crowds).
  • If you need a fully open-source stack and cannot use closed LLM/VLM APIs.
  • For fine-grained text reading or tiny visual details beyond current VLM capabilities.

Failure Modes

  • Map merging errors causing landmark misidentification and wrong navigation targets.
  • Collector overadjustment that degrades image quality and lowers QAA after many steps.
  • Error accumulation in long-horizon plans driven by incorrect LLM parsing or plan steps.
  • LLM judge bias or scoring mismatches on corner-case open-vocab answers.

Core Entities

Models

  • GPT-4o
  • GPT-4
  • Qwen-2.5
  • LLaMA-v3.1-8b
  • DeepSeek-v3
  • GroundSAM
  • LaV A-style VLMs (GPT-4o used as VLM in experiments)

Metrics

  • Accuracy
  • Navigation Error (NE)
  • Mean Time Step (MTS)

Datasets

  • CityEQA-EC
  • EmbodiedCity (simulator)

Benchmarks

  • CityEQA-EC

Context Entities

Models

  • OpenEQA baselines (FBE)
  • LoRA

Datasets

  • City-3DQA, EarthVQA, Open3DVQA (related outdoor QA datasets)