Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
2
Why It Matters For Business
PReP shows you can build autonomous navigation agents that operate without explicit step-by-step instructions and with far less RL data, enabling faster prototyping for navigation assistants, accessibility tools, and search-and-rescue prototypes.
Summary TLDR
This paper builds PReP, an LLM-driven agentic workflow for finding goal locations in city street graphs when the goal is only described relative to landmarks. Components: a fine-tuned LLaVA vision model for landmark direction/distance, a memory module (episodic + semantic + working memory) to form a cognitive map, and an LLM planner that decomposes routes into sub-goals. On four CBD datasets (Beijing, Shanghai, New York, Paris) PReP reaches ~54% average success rate, improving substantially over reactive LLM prompting and several other baselines. The method trades off model API use and per-step latency (~12s) for data efficiency versus RL and does best when landmarks are visible.
Problem Statement
Given only street-view images and a textual goal described relative to known landmarks (no step-by-step instructions or map), make an agent that self-localizes, infers goal direction and distance, forms an internal map from experience, and plans multi-step routes to reach the goal in complex city road graphs.
Main Contribution
PReP workflow: Perceive (vision LLM), Reflect (episodic + semantic + working memory), Plan (LLM decomposes into sub-goals).
Fine-tuned LLaVA-7B perception with LoRA for landmark detection and rough distance/direction estimation.
Natural-language long-term memory + retrieval and an anticipate-evaluate working memory for robust inference when landmarks are invisible.
Evaluation on four real-CBD city datasets and ablations showing large gains over reactive prompting and several LLM-based baselines.
Key Findings
PReP substantially improves success rate over reactive and other LLM prompting baselines.
Fine-tuned LLaVA perception is nearly as good as ground-truth perception for navigation.
Ablations show both reflection (memory) and planning matter; removing them drops performance.
Fine-tuning massively improves landmark detection metrics.
Results
Success Rate (Beijing)
Success Rate (Shanghai)
Perception gap vs oracle (avg SR)
Accuracy
Who Should Care
What To Try In 7 Days
Fine-tune a vision LLM (LLaVA) with LoRA on a few thousand landmark images for your area.
Log agent steps as short natural-language episodic memories and test simple retrieval + LLM summarization.
Prototype a planner prompt that decomposes target direction into 2–4 sub-goals and compare against reactive prompting.
Agent Features
Memory
- episodic memory (step-by-step text logs)
- semantic memory (summaries learned from episodes)
- working memory with anticipate-evaluate retrieval
Planning
- long-term plan decomposition into sub-goals
- short-term action selection from road connections
Tool Use
- LoRA
- LLM API calls (GPT-4-turbo etc.)
Frameworks
- prompt-based LLM orchestration
- LLaMA-Factory (finetune pipeline)
Is Agentic
true
Architectures
- multimodal LLM (LLaVA-7B)
- LLM planner (GPT-4-turbo / LLaMA3-8B fine-tuned)
Optimization Features
Infra Optimization
- Single NVIDIA A100 used for fine-tuning; inference done via GPU plus LLM API
Model Optimization
- LoRA
Training Optimization
- Small annotated dataset (≈8k images) to fine-tune perception quickly
Inference Optimization
- None targeted; per-step latency ~12s reported
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Performance depends on powerful closed-source LLMs (GPT-4-turbo) for top results.
- Test sets are limited (100 tasks per city), so reported SR may fluctuate on larger sets.
- Method relies on visible, identifiable landmarks; performance degrades when landmarks are scarce or invisible.
- Per-step latency (~12s) and API use may be too slow/expensive for real-time robotic deployment.
When Not To Use
- When low-latency, real-time control is required (per-step ~12s).
- In environments with few or ambiguous landmarks.
- When strict safety certification is required for autonomous vehicles.
Failure Modes
- Repeating loops or circling when memory retrieval or planning fails.
- Getting stuck in dead-ends if perception infers goal direction but road graph lacks route.
- Perception misdetections (false positives) lead to wrong goal inferences.
- Dependence on closed-source LLM behavior and prompt sensitivity.
Core Entities
Models
- LLaVA-7B
- LLaMA3-8B
- GPT-4-turbo
- GPT-3.5-turbo
- GLM-4
- Mistral-7B
Metrics
- Success Rate (SR)
- Success weighted by Path Length (SPL)
- Accuracy
Datasets
- Beijing_CBD_streetviews
- Shanghai_CBD_streetviews
- NewYork_CBD_streetviews
- Paris_CBD_streetviews
Benchmarks
- PReP four-city navigation testset (100 tasks per city)

