LLM agent that perceives landmarks, stores memories, and plans to navigate cities without step-by-step instructions

August 8, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

2

Authors

Qingbin Zeng, Qinglong Yang, Shunan Dong, Heming Du, Liang Zheng, Fengli Xu, Yong Li

Links

Abstract / PDF

Why It Matters For Business

PReP shows you can build autonomous navigation agents that operate without explicit step-by-step instructions and with far less RL data, enabling faster prototyping for navigation assistants, accessibility tools, and search-and-rescue prototypes.

Summary TLDR

This paper builds PReP, an LLM-driven agentic workflow for finding goal locations in city street graphs when the goal is only described relative to landmarks. Components: a fine-tuned LLaVA vision model for landmark direction/distance, a memory module (episodic + semantic + working memory) to form a cognitive map, and an LLM planner that decomposes routes into sub-goals. On four CBD datasets (Beijing, Shanghai, New York, Paris) PReP reaches ~54% average success rate, improving substantially over reactive LLM prompting and several other baselines. The method trades off model API use and per-step latency (~12s) for data efficiency versus RL and does best when landmarks are visible.

Problem Statement

Given only street-view images and a textual goal described relative to known landmarks (no step-by-step instructions or map), make an agent that self-localizes, infers goal direction and distance, forms an internal map from experience, and plans multi-step routes to reach the goal in complex city road graphs.

Main Contribution

PReP workflow: Perceive (vision LLM), Reflect (episodic + semantic + working memory), Plan (LLM decomposes into sub-goals).

Fine-tuned LLaVA-7B perception with LoRA for landmark detection and rough distance/direction estimation.

Natural-language long-term memory + retrieval and an anticipate-evaluate working memory for robust inference when landmarks are invisible.

Evaluation on four real-CBD city datasets and ablations showing large gains over reactive prompting and several LLM-based baselines.

Key Findings

PReP substantially improves success rate over reactive and other LLM prompting baselines.

NumbersAverage SR ≈ 54% across four city test sets

Fine-tuned LLaVA perception is nearly as good as ground-truth perception for navigation.

NumbersFine-tuned LLaVA SR ≈ 4% lower than oracle SR on average

Ablations show both reflection (memory) and planning matter; removing them drops performance.

NumbersBeijing SR: PReP 66% vs React 41% (Δ +25%)

Fine-tuning massively improves landmark detection metrics.

NumbersLLaVA-FT accuracy 0.998 vs LLaVA-base 0.1873

Results

Success Rate (Beijing)

Value66% (PReP)

BaselineReact 41%

Success Rate (Shanghai)

Value51% (PReP)

BaselineReact 25%

Perception gap vs oracle (avg SR)

ValueFine-tuned LLaVA SR ≈ 4% lower than oracle SR

BaselineOracle (GPS-based) perception

Accuracy

ValueAccuracy 0.998, IoU 0.9152

BaselineLLaVA-base accuracy 0.1873, IoU 0.6432

Who Should Care

What To Try In 7 Days

Fine-tune a vision LLM (LLaVA) with LoRA on a few thousand landmark images for your area.

Log agent steps as short natural-language episodic memories and test simple retrieval + LLM summarization.

Prototype a planner prompt that decomposes target direction into 2–4 sub-goals and compare against reactive prompting.

Agent Features

Memory

  • episodic memory (step-by-step text logs)
  • semantic memory (summaries learned from episodes)
  • working memory with anticipate-evaluate retrieval

Planning

  • long-term plan decomposition into sub-goals
  • short-term action selection from road connections

Tool Use

  • LoRA
  • LLM API calls (GPT-4-turbo etc.)

Frameworks

  • prompt-based LLM orchestration
  • LLaMA-Factory (finetune pipeline)

Is Agentic

true

Architectures

  • multimodal LLM (LLaVA-7B)
  • LLM planner (GPT-4-turbo / LLaMA3-8B fine-tuned)

Optimization Features

Infra Optimization

  • Single NVIDIA A100 used for fine-tuning; inference done via GPU plus LLM API

Model Optimization

  • LoRA

Training Optimization

  • Small annotated dataset (≈8k images) to fine-tune perception quickly

Inference Optimization

  • None targeted; per-step latency ~12s reported

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance depends on powerful closed-source LLMs (GPT-4-turbo) for top results.
  • Test sets are limited (100 tasks per city), so reported SR may fluctuate on larger sets.
  • Method relies on visible, identifiable landmarks; performance degrades when landmarks are scarce or invisible.
  • Per-step latency (~12s) and API use may be too slow/expensive for real-time robotic deployment.

When Not To Use

  • When low-latency, real-time control is required (per-step ~12s).
  • In environments with few or ambiguous landmarks.
  • When strict safety certification is required for autonomous vehicles.

Failure Modes

  • Repeating loops or circling when memory retrieval or planning fails.
  • Getting stuck in dead-ends if perception infers goal direction but road graph lacks route.
  • Perception misdetections (false positives) lead to wrong goal inferences.
  • Dependence on closed-source LLM behavior and prompt sensitivity.

Core Entities

Models

  • LLaVA-7B
  • LLaMA3-8B
  • GPT-4-turbo
  • GPT-3.5-turbo
  • GLM-4
  • Mistral-7B

Metrics

  • Success Rate (SR)
  • Success weighted by Path Length (SPL)
  • Accuracy

Datasets

  • Beijing_CBD_streetviews
  • Shanghai_CBD_streetviews
  • NewYork_CBD_streetviews
  • Paris_CBD_streetviews

Benchmarks

  • PReP four-city navigation testset (100 tasks per city)