LLM agent that perceives landmarks, stores memories, and plans to navigate cities without step-by-step instructions

August 8, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper demonstrates consistent simulated gains and ablations, but relies on closed-source LLMs, a limited testset, and simulation rather than real-world deployment.

Citations2

Evidence Strength0.75

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Qingbin Zeng, Qinglong Yang, Shunan Dong, Heming Du, Liang Zheng, Fengli Xu, Yong Li

Links

Abstract / PDF / Code

Why It Matters For Business

PReP shows you can build autonomous navigation agents that operate without explicit step-by-step instructions and with far less RL data, enabling faster prototyping for navigation assistants, accessibility tools, and search-and-rescue prototypes.

Who Should Care

Summary TLDR

This paper builds PReP, an LLM-driven agentic workflow for finding goal locations in city street graphs when the goal is only described relative to landmarks. Components: a fine-tuned LLaVA vision model for landmark direction/distance, a memory module (episodic + semantic + working memory) to form a cognitive map, and an LLM planner that decomposes routes into sub-goals. On four CBD datasets (Beijing, Shanghai, New York, Paris) PReP reaches ~54% average success rate, improving substantially over reactive LLM prompting and several other baselines. The method trades off model API use and per-step latency (~12s) for data efficiency versus RL and does best when landmarks are visible.

Problem Statement

Given only street-view images and a textual goal described relative to known landmarks (no step-by-step instructions or map), make an agent that self-localizes, infers goal direction and distance, forms an internal map from experience, and plans multi-step routes to reach the goal in complex city road graphs.

Main Contribution

PReP workflow: Perceive (vision LLM), Reflect (episodic + semantic + working memory), Plan (LLM decomposes into sub-goals).

Fine-tuned LLaVA-7B perception with LoRA for landmark detection and rough distance/direction estimation.

Key Findings

PReP substantially improves success rate over reactive and other LLM prompting baselines.

NumbersAverage SR ≈ 54% across four city test sets

Practical UseUse a perception+memory+planning workflow instead of stepwise reactive prompts to get much higher navigation success in city maps.

Evidence RefAbstract; Sec.5; Table 1

Fine-tuned LLaVA perception is nearly as good as ground-truth perception for navigation.

NumbersFine-tuned LLaVA SR ≈ 4% lower than oracle SR on average

Practical UseSpend modest compute to fine-tune the vision LLM (LoRA) — perception quality largely closes the gap to GPS-based oracle.

Evidence RefSec.5.3; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Success Rate (Beijing)66% (PReP)React 41%+25% absoluteBeijing test set (100 tasks)Table 1, Sec.5.2Table 1
Success Rate (Shanghai)51% (PReP)React 25%+26% absoluteShanghai test set (100 tasks)Table 1, Sec.5.2Table 1

What To Try In 7 Days

Fine-tune a vision LLM (LLaVA) with LoRA on a few thousand landmark images for your area.

Log agent steps as short natural-language episodic memories and test simple retrieval + LLM summarization.

Prototype a planner prompt that decomposes target direction into 2–4 sub-goals and compare against reactive prompting.

Agent Features

Memory
episodic memory (step-by-step text logs)semantic memory (summaries learned from episodes)working memory with anticipate-evaluate retrieval
Planning
long-term plan decomposition into sub-goalsshort-term action selection from road connections
Tool Use
LoRALLM API calls (GPT-4-turbo etc.)
Frameworks
prompt-based LLM orchestrationLLaMA-Factory (finetune pipeline)
Is Agentic

Yes

Architectures
multimodal LLM (LLaVA-7B)LLM planner (GPT-4-turbo / LLaMA3-8B fine-tuned)

Optimization Features

Infra Optimization
Single NVIDIA A100 used for fine-tuning; inference done via GPU plus LLM API
Model Optimization
LoRA
Training Optimization
Small annotated dataset (≈8k images) to fine-tune perception quickly
Inference Optimization
None targeted; per-step latency ~12s reported

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Performance depends on powerful closed-source LLMs (GPT-4-turbo) for top results.

Test sets are limited (100 tasks per city), so reported SR may fluctuate on larger sets.

When Not To Use

When low-latency, real-time control is required (per-step ~12s).

In environments with few or ambiguous landmarks.

Failure Modes

Repeating loops or circling when memory retrieval or planning fails.

Getting stuck in dead-ends if perception infers goal direction but road graph lacks route.

Core Entities

Models

LLaVA-7BLLaMA3-8BGPT-4-turboGPT-3.5-turboGLM-4Mistral-7B

Metrics

Success Rate (SR)Success weighted by Path Length (SPL)Accuracy

Datasets

Beijing_CBD_streetviewsShanghai_CBD_streetviewsNewYork_CBD_streetviewsParis_CBD_streetviews

Benchmarks

PReP four-city navigation testset (100 tasks per city)