Fuse object-level driving vectors into an LLM to explain and predict driving actions

Overview

Decision SnapshotNeeds Validation

The approach shows clear open-loop gains in simulated QA and action reasoning but is untested in closed-loop control, is numerically imprecise for some regressions, and is slow to run for real-time deployment.

Citations6

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, Jamie Shotton

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Grounding compact numeric scene vectors into an LLM yields interpretable, language-based explanations and improves action reasoning in simulation; this accelerates prototyping of explainable driving features but is not yet production-ready for closed-loop control.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

The paper builds an LLM-based driving agent that reads compact object-level vectors (cars, pedestrians, ego state, route), converts them to structured text, and uses a frozen LLaMA-7b with LoRA adapters to answer driving questions and produce control commands. They release a 160k QA dataset (10k scenarios) generated using an RL expert and GPT teacher, show that a two-stage grounding pretraining improves perception and action prediction in simulation, and highlight limits: open-loop evaluation only, numeric inaccuracies, and slow inference for closed-loop control.

Problem Statement

Modern end-to-end driving models are hard to interpret and struggle with out-of-distribution reasoning. The paper asks: can we ground compact numeric object-level vectors into a pretrained LLM so the LLM can both explain scenarios in text and output control actions?

Main Contribution

A modular architecture that fuses object-level numeric vectors into a frozen LLM (LLaMA-7b) via vector encoders, a Vector Former, and LoRA adapters.

A driving dataset and auto-label pipeline: 10k simulated scenarios, 160k GPT-generated question-answer pairs, plus 100k pseudo-caption pairs for pretraining.

Key Findings

Pretraining the vector-to-language stage improves Driving QA scores.

NumbersGPT score: 8.39 vs 7.48 (10k finetune set; +0.91 abs, +9.1%)

Practical UseInclude a representation pretraining phase converting vectors to structured text before finetuning on QA/action pairs.

Evidence RefTable 2

LLM-Driver (with pretraining) produces markedly lower action errors than a Perceiver-BC baseline.

NumbersLongitudinal MAE 0.066 vs 0.180; lateral MAE 0.014 vs 0.111

Practical UseUse pretrained LLM reasoning for action inference when you value interpretable, rule-like decisions over purely regression-based outputs.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
E_car (MAE count of cars)	0.066 (LLM-Driver w/ pretrain)	0.869 (Perceiver-BC)	−0.803	Evaluation set (1k scenarios)	Table 1 shows LLM-Driver w/ pretrain E_car 0.066, Perceiver-BC 0.869	Table 1
E_ped (MAE count of pedestrians)	0.313 (LLM-Driver w/ pretrain)	0.684 (Perceiver-BC)	−0.371	Evaluation set (1k scenarios)	Table 1 shows LLM-Driver w/ pretrain E_ped 0.313, Perceiver-BC 0.684	Table 1

What To Try In 7 Days

Run lanGen to convert your object-level vectors into structured text to inspect how scenarios read to an LLM.

Pretrain a small vector encoder by freezing an LLM and training on pseudo-caption pairs to align numeric tokens with language.

Fine-tune a small LLaMA+LoRA on a few hundred scenario QA pairs and compare token-decoded actions vs a regression baseline.

Agent Features

Tool Use

LoRAPPOGPT-3.5 (teacher/grader)

Frameworks

Perceiver IOLoRAPPO

Architectures

LoRAVector Encoder + Vector FormerPerceiver-BC baseline

Optimization Features

Model Optimization

LoRA

Training Optimization

Two-stage: representation pretraining (freeze LLM) then end-to-end finetuning

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/wayveai/Driving-with-LLMs

Data URLs

https://github.com/wayveai/Driving-with-LLMs

Risks & Boundaries

Limitations

Evaluation is open-loop in simulation; closed-loop control performance is untested.

Numeric outputs can be imprecise (e.g., traffic-light distance MAE much larger than regression baseline).

When Not To Use

When you need precise, low-latency closed-loop control in real vehicles.

When strict numeric accuracy (e.g., meter-level distances) is required.

Failure Modes

Hallucinated objects or explanations not grounded in vector input.

Poor numeric regression (large errors on distances) from token decoding.

Core Entities

Models

LLaMA-7bLoRAPerceiver IOPerceiver-BC

Metrics

E_car (MAE agents count)E_ped (MAE pedestrians count)AccuracyD_TL (traffic light distance MAE in meters)E_lon (longitudinal MAE, normalized accel/brake)E_lat (lateral MAE, normalized steering)L_token (weighted token cross-entropy)GPT-3.5 grading (0-10)Human grading (0-10)

Datasets

Driving QA dataset (160k QA, 10k scenarios)Pretraining pseudo-caption dataset (100k pairs)RL expert trajectory data

Benchmarks

Driving QA (DQA)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Pretraining the vector-to-language stage improves Driving QA scores.

LLM-Driver (with pretraining) produces markedly lower action errors than a Perceiver-BC baseline.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-