Use LLMs to patch rule-based driving planners and cut dangerous scenarios on nuPlan.

Overview

Decision SnapshotNeeds Validation

LLM-ASSIST shows that constrained use of LLMs (parameter outputs) improves closed-loop planning on nuPlan; still, latency, grounding, and hallucination risks must be fixed before real-world deployment.

Citations6

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 55%

Authors

S P Sharan, Francesco Pittaluga, Vijay Kumar B G, Manmohan Chandraker

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A language model can be used to patch edge-case failures of a strong rule-based planner and reduce dangerous scenarios without retraining the core planner, but latency, cost, and hallucination risk must be managed.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The paper builds a hybrid planner that keeps a strong rule-based planner (PDM-Closed) for routine driving and invokes an LLM when the base planner predicts low-quality proposals. Two LLM roles are tested: (1) unconstrained LLM outputs full trajectories, (2) parameterized LLM returns planner parameters for PDM-Closed. The parameterized approach (GPT-3-ASSISTPAR) gives state-of-the-art closed-loop results on the nuPlan val14 split and reduces dangerous driving scenarios by ~11% versus PDM-Closed. Limitations: text-only state input, LLM latency, and hallucination risk.

Problem Statement

Rule-based planners handle most traffic but fail in some complex or rare scenarios. Pure learning planners overfit or struggle in closed-loop settings. Can large language models’ commonsense reasoning be used to fix those hard cases without losing the safe behavior of rule-based planners?

Main Contribution

A score-based gating strategy: invoke an LLM only when the base planner's simulated proposal scores fall below thresholds.

Two LLM integrations: unconstrained LLM that outputs trajectories and a parameterized LLM that returns planner parameters for PDM-Closed.

Key Findings

Parameterizing the base planner with an LLM reduces dangerous driving scenarios.

Numbers11% fewer dangerous events vs PDM-Closed (nuPlan val14)

Practical UseUse an LLM to pick planner parameters rather than replacing the planner to get immediate safety gains on evaluated benchmarks.

Evidence RefSection 5.3, Table 2

GPT-3-ASSISTPAR yields higher overall closed-loop scores than PDM-Closed on val14.

NumbersNon-reactive score 93.05 vs 92.51 (PDM-Closed)

Practical UseSmall but consistent score gains suggest LLM-parameterization can improve real-world closed-loop metrics without retraining the planner.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Closed-loop non-reactive combined score	93.05 (GPT-3-ASSISTPAR)	92.51 (PDM-Closed)	+0.54	nuPlan val14	Table 2 reports scores for PDM-Closed and GPT-3-ASSISTPAR on val14	Table 2
Closed-loop reactive combined score	92.20 (GPT-3-ASSISTPAR)	91.79 (PDM-Closed)	+0.41	nuPlan val14	Table 2 reports scores for PDM-Closed and GPT-3-ASSISTPAR on val14	Table 2

What To Try In 7 Days

Add a simulation-based score threshold to detect low-confidence planner outputs and gate LLM invocation.

Implement a parameterized LLM interface that returns planner hyperparameters, not raw trajectories.

Run an offline eval on a held-out set (nuPlan val14 or your scenarios) permitting up to 4 LLM queries per decision step and measure safety metrics and latency.

Agent Features

Planning

gated LLM invocation (score-based)parameterized planning (LLM outputs planner params)unconstrained planning (LLM outputs trajectories)

Tool Use

GPT-3GPT-4Llama2-7BPDM-Closed

Frameworks

nuPlan API

Is Agentic

Yes

Architectures

hybrid rule-based + LLM

Collaboration

LLM supplements base planner decisions

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://llmassist.github.io

Data URLs

https://github.com/waymo-research/nuplan

Risks & Boundaries

Limitations

System uses a text-only parsed state, omitting raw sensor richness.

LLMs introduce latency; Llama2-7B took ~3s for a single parameter query in tests.

When Not To Use

In hard real-time control loops that require millisecond-level latency.

As a standalone replacement for the planner (LLM direct trajectories performed poorly).

Failure Modes

LLM hallucination produces incorrect planner parameters or malformed outputs.

LLM formatting errors break the planner interface and cause fallback to low-quality proposals.

Core Entities

Models

PDM-ClosedIDMGPT-3GPT-4Llama2-7B

Metrics

ScoreCollisionsTime-to-Collision (TTC)DrivableComfortProgressSpeed LimitDirection

Datasets

nuPlan val14

Benchmarks

nuPlan closed-loop non-reactivenuPlan closed-loop reactive

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Parameterizing the base planner with an LLM reduces dangerous driving scenarios.

GPT-3-ASSISTPAR yields higher overall closed-loop scores than PDM-Closed on val14.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding