Overview
The paper provides a clear benchmark and a working hierarchical agent with ablations; results are solid in simulation but rely on external LLM/VLM APIs and limited dynamic-event coverage.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 35%
Production readiness: 30%
Novelty: 65%
Why It Matters For Business
CityEQA-EC and PMA provide a practical testbed for building drone/UAV perception and urban-inspection agents that use language-guided planning and map memory, reducing search time and distance vs naive exploration.
Who Should Care
Summary TLDR
This paper introduces CityEQA-EC, the first open-ended embodied question answering benchmark for realistic city scenes (1,412 validated tasks). It also proposes PMA, a Planner–Manager–Actor agent that uses LLMs for planning, a Vision-Language Model and GroundSAM for perception, and an object-centric 2D cognitive map for memory. On a 200-task test, PMA scores QAA 3.00±1.96 (60.7% of human EQA accuracy) while cutting navigation error and time steps vs standard exploration baselines. The dataset, code, and ablations show the map, navigator, explorer, and collector modules matter most. PMA still lags humans on visual reasoning and ignores dynamic/social events.
Problem Statement
Embodied Question Answering (EQA) has focused on indoor scenes. City environments are larger, visually ambiguous, and have view-dependent observations. We need agents that plan long-horizon exploration, use landmarks and spatial relations, and convert visual inputs into accurate open-vocabulary answers.
Main Contribution
CityEQA-EC: a validated benchmark of 1,412 open‑vocabulary EQA tasks in a realistic 3D city simulator.
PMA: a hierarchical Planner–Manager–Actor agent using LLMs for planning, GroundSAM/VLM for perception, and an object-centric 2D cognitive map for long-term memory.
Key Findings
CityEQA-EC contains 1,412 validated tasks across six task types.
PMA achieves QAA 3.00±1.96 vs human H-EQA 4.94±0.21, equal to 60.73% of human accuracy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| QAA (1-5) | 3.00±1.96 | PMA (ours) | vs H-EQA 4.94±0.21 (60.73% human) | CityEQA-EC (200-task eval sample) | Table 2; Section 4.2 | Table 2 |
| QAA (1-5) | 4.94±0.21 | H-EQA (human) | — | CityEQA-EC | Table 2; Section 4.2 | Table 2 |
What To Try In 7 Days
Download CityEQA-EC and run PMA on a small set to inspect map outputs and trajectories.
Replace the VLM (GPT-4o) with your VLM to compare visual recognition quality quickly.
Run the PMA ablation without the map to see how persistent memory affects efficiency.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Reproducibility
Risks & Boundaries
Limitations
Focuses on object-centric, static tasks; dynamic events and social interactions not covered.
Evaluation uses a 200-task sampled subset due to API limits, not the full dataset.
When Not To Use
When the task requires temporal reasoning or detection of dynamic events (traffic jams, crowds).
If you need a fully open-source stack and cannot use closed LLM/VLM APIs.
Failure Modes
Map merging errors causing landmark misidentification and wrong navigation targets.
Collector overadjustment that degrades image quality and lowers QAA after many steps.

