Fuse object-level driving vectors into an LLM to explain and predict driving actions

October 3, 20237 min

Overview

Decision SnapshotNeeds Validation

The approach shows clear open-loop gains in simulated QA and action reasoning but is untested in closed-loop control, is numerically imprecise for some regressions, and is slow to run for real-time deployment.

Citations6

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, Jamie Shotton

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Grounding compact numeric scene vectors into an LLM yields interpretable, language-based explanations and improves action reasoning in simulation; this accelerates prototyping of explainable driving features but is not yet production-ready for closed-loop control.

Who Should Care

Summary TLDR

The paper builds an LLM-based driving agent that reads compact object-level vectors (cars, pedestrians, ego state, route), converts them to structured text, and uses a frozen LLaMA-7b with LoRA adapters to answer driving questions and produce control commands. They release a 160k QA dataset (10k scenarios) generated using an RL expert and GPT teacher, show that a two-stage grounding pretraining improves perception and action prediction in simulation, and highlight limits: open-loop evaluation only, numeric inaccuracies, and slow inference for closed-loop control.

Problem Statement

Modern end-to-end driving models are hard to interpret and struggle with out-of-distribution reasoning. The paper asks: can we ground compact numeric object-level vectors into a pretrained LLM so the LLM can both explain scenarios in text and output control actions?

Main Contribution

A modular architecture that fuses object-level numeric vectors into a frozen LLM (LLaMA-7b) via vector encoders, a Vector Former, and LoRA adapters.

A driving dataset and auto-label pipeline: 10k simulated scenarios, 160k GPT-generated question-answer pairs, plus 100k pseudo-caption pairs for pretraining.

Key Findings

Pretraining the vector-to-language stage improves Driving QA scores.

NumbersGPT score: 8.39 vs 7.48 (10k finetune set; +0.91 abs, +9.1%)

Practical UseInclude a representation pretraining phase converting vectors to structured text before finetuning on QA/action pairs.

Evidence RefTable 2

LLM-Driver (with pretraining) produces markedly lower action errors than a Perceiver-BC baseline.

NumbersLongitudinal MAE 0.066 vs 0.180; lateral MAE 0.014 vs 0.111

Practical UseUse pretrained LLM reasoning for action inference when you value interpretable, rule-like decisions over purely regression-based outputs.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
E_car (MAE count of cars)0.066 (LLM-Driver w/ pretrain)0.869 (Perceiver-BC)−0.803Evaluation set (1k scenarios)Table 1 shows LLM-Driver w/ pretrain E_car 0.066, Perceiver-BC 0.869Table 1
E_ped (MAE count of pedestrians)0.313 (LLM-Driver w/ pretrain)0.684 (Perceiver-BC)−0.371Evaluation set (1k scenarios)Table 1 shows LLM-Driver w/ pretrain E_ped 0.313, Perceiver-BC 0.684Table 1

What To Try In 7 Days

Run lanGen to convert your object-level vectors into structured text to inspect how scenarios read to an LLM.

Pretrain a small vector encoder by freezing an LLM and training on pseudo-caption pairs to align numeric tokens with language.

Fine-tune a small LLaMA+LoRA on a few hundred scenario QA pairs and compare token-decoded actions vs a regression baseline.

Agent Features

Tool Use
LoRAPPOGPT-3.5 (teacher/grader)
Frameworks
Perceiver IOLoRAPPO
Architectures
LoRAVector Encoder + Vector FormerPerceiver-BC baseline

Optimization Features

Model Optimization
LoRA
Training Optimization
Two-stage: representation pretraining (freeze LLM) then end-to-end finetuning

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation is open-loop in simulation; closed-loop control performance is untested.

Numeric outputs can be imprecise (e.g., traffic-light distance MAE much larger than regression baseline).

When Not To Use

When you need precise, low-latency closed-loop control in real vehicles.

When strict numeric accuracy (e.g., meter-level distances) is required.

Failure Modes

Hallucinated objects or explanations not grounded in vector input.

Poor numeric regression (large errors on distances) from token decoding.

Core Entities

Models

LLaMA-7bLoRAPerceiver IOPerceiver-BC

Metrics

E_car (MAE agents count)E_ped (MAE pedestrians count)AccuracyD_TL (traffic light distance MAE in meters)E_lon (longitudinal MAE, normalized accel/brake)E_lat (lateral MAE, normalized steering)L_token (weighted token cross-entropy)GPT-3.5 grading (0-10)Human grading (0-10)

Datasets

Driving QA dataset (160k QA, 10k scenarios)Pretraining pseudo-caption dataset (100k pairs)RL expert trajectory data

Benchmarks

Driving QA (DQA)