Use LLMs to patch rule-based driving planners and cut dangerous scenarios on nuPlan.

December 30, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.4

Citation Count

6

Authors

S P Sharan, Francesco Pittaluga, Vijay Kumar B G, Manmohan Chandraker

Links

Abstract / PDF

Why It Matters For Business

A language model can be used to patch edge-case failures of a strong rule-based planner and reduce dangerous scenarios without retraining the core planner, but latency, cost, and hallucination risk must be managed.

Summary TLDR

The paper builds a hybrid planner that keeps a strong rule-based planner (PDM-Closed) for routine driving and invokes an LLM when the base planner predicts low-quality proposals. Two LLM roles are tested: (1) unconstrained LLM outputs full trajectories, (2) parameterized LLM returns planner parameters for PDM-Closed. The parameterized approach (GPT-3-ASSISTPAR) gives state-of-the-art closed-loop results on the nuPlan val14 split and reduces dangerous driving scenarios by ~11% versus PDM-Closed. Limitations: text-only state input, LLM latency, and hallucination risk.

Problem Statement

Rule-based planners handle most traffic but fail in some complex or rare scenarios. Pure learning planners overfit or struggle in closed-loop settings. Can large language models’ commonsense reasoning be used to fix those hard cases without losing the safe behavior of rule-based planners?

Main Contribution

A score-based gating strategy: invoke an LLM only when the base planner's simulated proposal scores fall below thresholds.

Two LLM integrations: unconstrained LLM that outputs trajectories and a parameterized LLM that returns planner parameters for PDM-Closed.

Demonstration that the parameterized LLM (GPT-3-ASSISTPAR) achieves SOTA on nuPlan val14 closed-loop reactive/non-reactive benchmarks and reduces dangerous events by ~11%.

Ablations showing benefits of multiple LLM queries, emergency-brake control, and temperature tuning.

Key Findings

Parameterizing the base planner with an LLM reduces dangerous driving scenarios.

Numbers11% fewer dangerous events vs PDM-Closed (nuPlan val14)

GPT-3-ASSISTPAR yields higher overall closed-loop scores than PDM-Closed on val14.

NumbersNon-reactive score 93.05 vs 92.51 (PDM-Closed)

Direct trajectory outputs from a plain LLM perform poorly compared to the hybrid approach.

NumbersGPT-3 planner score 18.08 vs GPT-3-ASSISTPAR 94.8 on a 140-sample subset

Allowing multiple LLM queries per time step improves planner performance.

NumbersNon-reactive score rises from 92.51 (0 queries) to 93.05 (4 queries)

Giving the LLM emergency-brake control slightly improves safety metrics.

NumbersReactive score 92.16 with brake vs 91.85 without

Results

Closed-loop non-reactive combined score

Value93.05 (GPT-3-ASSISTPAR)

Baseline92.51 (PDM-Closed)

Closed-loop reactive combined score

Value92.20 (GPT-3-ASSISTPAR)

Baseline91.79 (PDM-Closed)

Dangerous driving events

Value11% reduction (GPT-3-ASSISTPAR vs PDM-Closed)

BaselinePDM-Closed

Pure LLM (direct trajectories) performance on subset

Value18.08 (GPT-3)

Baseline94.8 (GPT-3-ASSISTPAR on same subset)

Who Should Care

What To Try In 7 Days

Add a simulation-based score threshold to detect low-confidence planner outputs and gate LLM invocation.

Implement a parameterized LLM interface that returns planner hyperparameters, not raw trajectories.

Run an offline eval on a held-out set (nuPlan val14 or your scenarios) permitting up to 4 LLM queries per decision step and measure safety metrics and latency.

Agent Features

Planning

  • gated LLM invocation (score-based)
  • parameterized planning (LLM outputs planner params)
  • unconstrained planning (LLM outputs trajectories)

Tool Use

  • GPT-3
  • GPT-4
  • Llama2-7B
  • PDM-Closed

Frameworks

  • nuPlan API

Is Agentic

true

Architectures

  • hybrid rule-based + LLM

Collaboration

  • LLM supplements base planner decisions

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • System uses a text-only parsed state, omitting raw sensor richness.
  • LLMs introduce latency; Llama2-7B took ~3s for a single parameter query in tests.
  • LLMs can hallucinate or misformat outputs; this is risky in safety-critical control.
  • Results are reported on nuPlan val14; real-world generalization remains unproven.

When Not To Use

  • In hard real-time control loops that require millisecond-level latency.
  • As a standalone replacement for the planner (LLM direct trajectories performed poorly).
  • In safety-critical deployment without certified checks and redundancy.

Failure Modes

  • LLM hallucination produces incorrect planner parameters or malformed outputs.
  • LLM formatting errors break the planner interface and cause fallback to low-quality proposals.
  • Increased compute and API costs from multiple LLM queries per time step.
  • Dependency on commercial LLMs that may change behavior or access.

Core Entities

Models

  • PDM-Closed
  • IDM
  • GPT-3
  • GPT-4
  • Llama2-7B

Metrics

  • Score
  • Collisions
  • Time-to-Collision (TTC)
  • Drivable
  • Comfort
  • Progress
  • Speed Limit
  • Direction

Datasets

  • nuPlan val14

Benchmarks

  • nuPlan closed-loop non-reactive
  • nuPlan closed-loop reactive