Overview
Production Readiness
0.3
Novelty Score
0.65
Cost Impact Score
0.35
Citation Count
0
Why It Matters For Business
CityEQA-EC and PMA provide a practical testbed for building drone/UAV perception and urban-inspection agents that use language-guided planning and map memory, reducing search time and distance vs naive exploration.
Summary TLDR
This paper introduces CityEQA-EC, the first open-ended embodied question answering benchmark for realistic city scenes (1,412 validated tasks). It also proposes PMA, a Planner–Manager–Actor agent that uses LLMs for planning, a Vision-Language Model and GroundSAM for perception, and an object-centric 2D cognitive map for memory. On a 200-task test, PMA scores QAA 3.00±1.96 (60.7% of human EQA accuracy) while cutting navigation error and time steps vs standard exploration baselines. The dataset, code, and ablations show the map, navigator, explorer, and collector modules matter most. PMA still lags humans on visual reasoning and ignores dynamic/social events.
Problem Statement
Embodied Question Answering (EQA) has focused on indoor scenes. City environments are larger, visually ambiguous, and have view-dependent observations. We need agents that plan long-horizon exploration, use landmarks and spatial relations, and convert visual inputs into accurate open-vocabulary answers.
Main Contribution
CityEQA-EC: a validated benchmark of 1,412 open‑vocabulary EQA tasks in a realistic 3D city simulator.
PMA: a hierarchical Planner–Manager–Actor agent using LLMs for planning, GroundSAM/VLM for perception, and an object-centric 2D cognitive map for long-term memory.
Empirical evaluation and ablations showing PMA outperforms blind, VQA, Socratic, and naive exploring baselines and revealing key module contributions.
Key Findings
CityEQA-EC contains 1,412 validated tasks across six task types.
PMA achieves QAA 3.00±1.96 vs human H-EQA 4.94±0.21, equal to 60.73% of human accuracy.
PMA reduces navigation error and time steps compared to frontier/random exploring agents.
Removing the object-centric map greatly harms performance.
Collector (view fine-tuning) improves accuracy with steps but can plateau or slightly degrade if overadjusted.
LLM-based automatic scoring aligns with humans (Spearman R_s = 0.85).
Results
QAA (1-5)
QAA (1-5)
Navigation Error (m)
Mean Time Step (steps)
QAA ablation (no map)
Who Should Care
What To Try In 7 Days
Download CityEQA-EC and run PMA on a small set to inspect map outputs and trajectories.
Replace the VLM (GPT-4o) with your VLM to compare visual recognition quality quickly.
Run the PMA ablation without the map to see how persistent memory affects efficiency.
Agent Features
Memory
- Object-centric cognitive map (2D grids, merged over time)
- Req_info and History memory modules
Planning
- LLM-driven Planner using few-shot Chain-of-Thought
- LoRA
Tool Use
- VLM for visual Q&A and action selection (Collector)
- GroundSAM for grounding/segmentation
- A* for path planning
Frameworks
- EmbodiedCity (Unreal Engine 4 + AirSim)
- GroundSAM
- GPT-4o/GPT-4 as VLM/LLM
Is Agentic
true
Architectures
- Planner–Manager–Actor hierarchy
- Object-centric 2D grid cognitive map
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Focuses on object-centric, static tasks; dynamic events and social interactions not covered.
- Evaluation uses a 200-task sampled subset due to API limits, not the full dataset.
- Relies on external closed-source VLM/LLM (GPT-4o/GPT-4) which may limit reproducibility and cost.
- Simulation-to-real transfer is not addressed; real-world conditions may reduce performance.
When Not To Use
- When the task requires temporal reasoning or detection of dynamic events (traffic jams, crowds).
- If you need a fully open-source stack and cannot use closed LLM/VLM APIs.
- For fine-grained text reading or tiny visual details beyond current VLM capabilities.
Failure Modes
- Map merging errors causing landmark misidentification and wrong navigation targets.
- Collector overadjustment that degrades image quality and lowers QAA after many steps.
- Error accumulation in long-horizon plans driven by incorrect LLM parsing or plan steps.
- LLM judge bias or scoring mismatches on corner-case open-vocab answers.
Core Entities
Models
- GPT-4o
- GPT-4
- Qwen-2.5
- LLaMA-v3.1-8b
- DeepSeek-v3
- GroundSAM
- LaV A-style VLMs (GPT-4o used as VLM in experiments)
Metrics
- Accuracy
- Navigation Error (NE)
- Mean Time Step (MTS)
Datasets
- CityEQA-EC
- EmbodiedCity (simulator)
Benchmarks
- CityEQA-EC
Context Entities
Models
- OpenEQA baselines (FBE)
- LoRA
Datasets
- City-3DQA, EarthVQA, Open3DVQA (related outdoor QA datasets)

