LLM agent that perceives landmarks, stores memories, and plans to navigate cities without step-by-step instructions

Overview

Decision SnapshotNeeds Validation

The paper demonstrates consistent simulated gains and ablations, but relies on closed-source LLMs, a limited testset, and simulation rather than real-world deployment.

Citations2

Evidence Strength0.75

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Qingbin Zeng, Qinglong Yang, Shunan Dong, Heming Du, Liang Zheng, Fengli Xu, Yong Li

Links

Abstract / PDF / Code

Why It Matters For Business

PReP shows you can build autonomous navigation agents that operate without explicit step-by-step instructions and with far less RL data, enabling faster prototyping for navigation assistants, accessibility tools, and search-and-rescue prototypes.

Who Should Care

Product Manager CTO ML Engineer Data Scientist

Summary TLDR

This paper builds PReP, an LLM-driven agentic workflow for finding goal locations in city street graphs when the goal is only described relative to landmarks. Components: a fine-tuned LLaVA vision model for landmark direction/distance, a memory module (episodic + semantic + working memory) to form a cognitive map, and an LLM planner that decomposes routes into sub-goals. On four CBD datasets (Beijing, Shanghai, New York, Paris) PReP reaches ~54% average success rate, improving substantially over reactive LLM prompting and several other baselines. The method trades off model API use and per-step latency (~12s) for data efficiency versus RL and does best when landmarks are visible.

Problem Statement

Given only street-view images and a textual goal described relative to known landmarks (no step-by-step instructions or map), make an agent that self-localizes, infers goal direction and distance, forms an internal map from experience, and plans multi-step routes to reach the goal in complex city road graphs.

Main Contribution

PReP workflow: Perceive (vision LLM), Reflect (episodic + semantic + working memory), Plan (LLM decomposes into sub-goals).

Fine-tuned LLaVA-7B perception with LoRA for landmark detection and rough distance/direction estimation.

Key Findings

PReP substantially improves success rate over reactive and other LLM prompting baselines.

NumbersAverage SR ≈ 54% across four city test sets

Practical UseUse a perception+memory+planning workflow instead of stepwise reactive prompts to get much higher navigation success in city maps.

Evidence RefAbstract; Sec.5; Table 1

Fine-tuned LLaVA perception is nearly as good as ground-truth perception for navigation.

NumbersFine-tuned LLaVA SR ≈ 4% lower than oracle SR on average

Practical UseSpend modest compute to fine-tune the vision LLM (LoRA) — perception quality largely closes the gap to GPS-based oracle.

Evidence RefSec.5.3; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Success Rate (Beijing)	66% (PReP)	React 41%	+25% absolute	Beijing test set (100 tasks)	Table 1, Sec.5.2	Table 1
Success Rate (Shanghai)	51% (PReP)	React 25%	+26% absolute	Shanghai test set (100 tasks)	Table 1, Sec.5.2	Table 1

What To Try In 7 Days

Fine-tune a vision LLM (LLaVA) with LoRA on a few thousand landmark images for your area.

Log agent steps as short natural-language episodic memories and test simple retrieval + LLM summarization.

Prototype a planner prompt that decomposes target direction into 2–4 sub-goals and compare against reactive prompting.

Agent Features

Memory

episodic memory (step-by-step text logs)semantic memory (summaries learned from episodes)working memory with anticipate-evaluate retrieval

Planning

long-term plan decomposition into sub-goalsshort-term action selection from road connections

Tool Use

LoRALLM API calls (GPT-4-turbo etc.)

Frameworks

prompt-based LLM orchestrationLLaMA-Factory (finetune pipeline)

Is Agentic

Yes

Architectures

multimodal LLM (LLaVA-7B)LLM planner (GPT-4-turbo / LLaMA3-8B fine-tuned)

Optimization Features

Infra Optimization

Single NVIDIA A100 used for fine-tuning; inference done via GPU plus LLM API

Model Optimization

LoRA

Training Optimization

Small annotated dataset (≈8k images) to fine-tune perception quickly

Inference Optimization

None targeted; per-step latency ~12s reported

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://anonymous.4open.science/r/PReP-13B5

Risks & Boundaries

Limitations

Performance depends on powerful closed-source LLMs (GPT-4-turbo) for top results.

Test sets are limited (100 tasks per city), so reported SR may fluctuate on larger sets.

When Not To Use

When low-latency, real-time control is required (per-step ~12s).

In environments with few or ambiguous landmarks.

Failure Modes

Repeating loops or circling when memory retrieval or planning fails.

Getting stuck in dead-ends if perception infers goal direction but road graph lacks route.

Core Entities

Models

LLaVA-7BLLaMA3-8BGPT-4-turboGPT-3.5-turboGLM-4Mistral-7B

Metrics

Success Rate (SR)Success weighted by Path Length (SPL)Accuracy

Datasets

Beijing_CBD_streetviewsShanghai_CBD_streetviewsNewYork_CBD_streetviewsParis_CBD_streetviews

Benchmarks

PReP four-city navigation testset (100 tasks per city)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

PReP substantially improves success rate over reactive and other LLM prompting baselines.

Fine-tuned LLaVA perception is nearly as good as ground-truth perception for navigation.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding