Fuse object-level driving vectors into an LLM to explain and predict driving actions

October 3, 20237 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

6

Authors

Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, Jamie Shotton

Links

Abstract / PDF

Why It Matters For Business

Grounding compact numeric scene vectors into an LLM yields interpretable, language-based explanations and improves action reasoning in simulation; this accelerates prototyping of explainable driving features but is not yet production-ready for closed-loop control.

Summary TLDR

The paper builds an LLM-based driving agent that reads compact object-level vectors (cars, pedestrians, ego state, route), converts them to structured text, and uses a frozen LLaMA-7b with LoRA adapters to answer driving questions and produce control commands. They release a 160k QA dataset (10k scenarios) generated using an RL expert and GPT teacher, show that a two-stage grounding pretraining improves perception and action prediction in simulation, and highlight limits: open-loop evaluation only, numeric inaccuracies, and slow inference for closed-loop control.

Problem Statement

Modern end-to-end driving models are hard to interpret and struggle with out-of-distribution reasoning. The paper asks: can we ground compact numeric object-level vectors into a pretrained LLM so the LLM can both explain scenarios in text and output control actions?

Main Contribution

A modular architecture that fuses object-level numeric vectors into a frozen LLM (LLaMA-7b) via vector encoders, a Vector Former, and LoRA adapters.

A driving dataset and auto-label pipeline: 10k simulated scenarios, 160k GPT-generated question-answer pairs, plus 100k pseudo-caption pairs for pretraining.

An evaluation protocol (Driving QA) using GPT-3.5 grading and a Perceiver-BC baseline; empirical results showing pretraining improves QA and action prediction in simulation.

Key Findings

Pretraining the vector-to-language stage improves Driving QA scores.

NumbersGPT score: 8.39 vs 7.48 (10k finetune set; +0.91 abs, +9.1%)

LLM-Driver (with pretraining) produces markedly lower action errors than a Perceiver-BC baseline.

NumbersLongitudinal MAE 0.066 vs 0.180; lateral MAE 0.014 vs 0.111

Perceiver-BC is stronger on simple numeric regressions like traffic-light distance.

NumbersTraffic-light distance MAE 0.410m (Perceiver) vs 6.624m (LLM-Driver)

A large auto-labelled dataset was created from simulation plus GPT supervision.

Numbers160k QA pairs across 10k scenarios; 100k pseudo-caption pairs for pretraining

Results

E_car (MAE count of cars)

Value0.066 (LLM-Driver w/ pretrain)

Baseline0.869 (Perceiver-BC)

E_ped (MAE count of pedestrians)

Value0.313 (LLM-Driver w/ pretrain)

Baseline0.684 (Perceiver-BC)

E_lon (normalized longitudinal MAE)

Value0.066 (LLM-Driver w/ pretrain)

Baseline0.180 (Perceiver-BC)

D_TL (traffic light distance MAE, meters)

Value6.624m (LLM-Driver w/ pretrain)

Baseline0.410m (Perceiver-BC)

Driving QA grade (GPT-3.5 average, 0-10)

Value8.39 (LLM-Driver w/ pretrain)

Baseline7.48 (LLM-Driver w/o pretrain)

Driving QA grade (Human average, 0-10)

Value7.71 (LLM-Driver w/ pretrain)

Baseline6.63 (LLM-Driver w/o pretrain)

Who Should Care

What To Try In 7 Days

Run lanGen to convert your object-level vectors into structured text to inspect how scenarios read to an LLM.

Pretrain a small vector encoder by freezing an LLM and training on pseudo-caption pairs to align numeric tokens with language.

Fine-tune a small LLaMA+LoRA on a few hundred scenario QA pairs and compare token-decoded actions vs a regression baseline.

Agent Features

Tool Use

  • LoRA
  • PPO
  • GPT-3.5 (teacher/grader)

Frameworks

  • Perceiver IO
  • LoRA
  • PPO

Architectures

  • LoRA
  • Vector Encoder + Vector Former
  • Perceiver-BC baseline

Optimization Features

Model Optimization

  • LoRA

Training Optimization

  • Two-stage: representation pretraining (freeze LLM) then end-to-end finetuning

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Evaluation is open-loop in simulation; closed-loop control performance is untested.
  • Numeric outputs can be imprecise (e.g., traffic-light distance MAE much larger than regression baseline).
  • Inference cost and latency of LLMs make real-time control impractical as-is.
  • Dataset labels are auto-generated with GPT and an RL expert; this can carry teacher-model biases.

When Not To Use

  • When you need precise, low-latency closed-loop control in real vehicles.
  • When strict numeric accuracy (e.g., meter-level distances) is required.
  • When model inference cost or latency must be minimal.

Failure Modes

  • Hallucinated objects or explanations not grounded in vector input.
  • Poor numeric regression (large errors on distances) from token decoding.
  • Slow inference leading to missed control windows.
  • Mismatch between open-loop answers and closed-loop behavior.

Core Entities

Models

  • LLaMA-7b
  • LoRA
  • Perceiver IO
  • Perceiver-BC

Metrics

  • E_car (MAE agents count)
  • E_ped (MAE pedestrians count)
  • Accuracy
  • D_TL (traffic light distance MAE in meters)
  • E_lon (longitudinal MAE, normalized accel/brake)
  • E_lat (lateral MAE, normalized steering)
  • L_token (weighted token cross-entropy)
  • GPT-3.5 grading (0-10)
  • Human grading (0-10)

Datasets

  • Driving QA dataset (160k QA, 10k scenarios)
  • Pretraining pseudo-caption dataset (100k pairs)
  • RL expert trajectory data

Benchmarks

  • Driving QA (DQA)