Overview
Production Readiness
0.6
Novelty Score
0.55
Cost Impact Score
0.4
Citation Count
6
Why It Matters For Business
A language model can be used to patch edge-case failures of a strong rule-based planner and reduce dangerous scenarios without retraining the core planner, but latency, cost, and hallucination risk must be managed.
Summary TLDR
The paper builds a hybrid planner that keeps a strong rule-based planner (PDM-Closed) for routine driving and invokes an LLM when the base planner predicts low-quality proposals. Two LLM roles are tested: (1) unconstrained LLM outputs full trajectories, (2) parameterized LLM returns planner parameters for PDM-Closed. The parameterized approach (GPT-3-ASSISTPAR) gives state-of-the-art closed-loop results on the nuPlan val14 split and reduces dangerous driving scenarios by ~11% versus PDM-Closed. Limitations: text-only state input, LLM latency, and hallucination risk.
Problem Statement
Rule-based planners handle most traffic but fail in some complex or rare scenarios. Pure learning planners overfit or struggle in closed-loop settings. Can large language models’ commonsense reasoning be used to fix those hard cases without losing the safe behavior of rule-based planners?
Main Contribution
A score-based gating strategy: invoke an LLM only when the base planner's simulated proposal scores fall below thresholds.
Two LLM integrations: unconstrained LLM that outputs trajectories and a parameterized LLM that returns planner parameters for PDM-Closed.
Demonstration that the parameterized LLM (GPT-3-ASSISTPAR) achieves SOTA on nuPlan val14 closed-loop reactive/non-reactive benchmarks and reduces dangerous events by ~11%.
Ablations showing benefits of multiple LLM queries, emergency-brake control, and temperature tuning.
Key Findings
Parameterizing the base planner with an LLM reduces dangerous driving scenarios.
GPT-3-ASSISTPAR yields higher overall closed-loop scores than PDM-Closed on val14.
Direct trajectory outputs from a plain LLM perform poorly compared to the hybrid approach.
Allowing multiple LLM queries per time step improves planner performance.
Giving the LLM emergency-brake control slightly improves safety metrics.
Results
Closed-loop non-reactive combined score
Closed-loop reactive combined score
Dangerous driving events
Pure LLM (direct trajectories) performance on subset
Who Should Care
What To Try In 7 Days
Add a simulation-based score threshold to detect low-confidence planner outputs and gate LLM invocation.
Implement a parameterized LLM interface that returns planner hyperparameters, not raw trajectories.
Run an offline eval on a held-out set (nuPlan val14 or your scenarios) permitting up to 4 LLM queries per decision step and measure safety metrics and latency.
Agent Features
Planning
- gated LLM invocation (score-based)
- parameterized planning (LLM outputs planner params)
- unconstrained planning (LLM outputs trajectories)
Tool Use
- GPT-3
- GPT-4
- Llama2-7B
- PDM-Closed
Frameworks
- nuPlan API
Is Agentic
true
Architectures
- hybrid rule-based + LLM
Collaboration
- LLM supplements base planner decisions
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- System uses a text-only parsed state, omitting raw sensor richness.
- LLMs introduce latency; Llama2-7B took ~3s for a single parameter query in tests.
- LLMs can hallucinate or misformat outputs; this is risky in safety-critical control.
- Results are reported on nuPlan val14; real-world generalization remains unproven.
When Not To Use
- In hard real-time control loops that require millisecond-level latency.
- As a standalone replacement for the planner (LLM direct trajectories performed poorly).
- In safety-critical deployment without certified checks and redundancy.
Failure Modes
- LLM hallucination produces incorrect planner parameters or malformed outputs.
- LLM formatting errors break the planner interface and cause fallback to low-quality proposals.
- Increased compute and API costs from multiple LLM queries per time step.
- Dependency on commercial LLMs that may change behavior or access.
Core Entities
Models
- PDM-Closed
- IDM
- GPT-3
- GPT-4
- Llama2-7B
Metrics
- Score
- Collisions
- Time-to-Collision (TTC)
- Drivable
- Comfort
- Progress
- Speed Limit
- Direction
Datasets
- nuPlan val14
Benchmarks
- nuPlan closed-loop non-reactive
- nuPlan closed-loop reactive

