Use LLMs (LightGPT) to control traffic lights with human-like reasoning and lower deployment cost

December 26, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.85

Citation Count

10

Authors

Siqi Lai, Zhao Xu, Weijia Zhang, Hao Liu, Hui Xiong

Links

Abstract / PDF

Why It Matters For Business

LLMLight enables interpretable, generalizable traffic control with much lower deployment cost than closed LLM APIs, making city-scale experiments and phased rollouts affordable.

Summary TLDR

This paper turns large language models into traffic-signal agents. LLMLight converts local sensor counts into a text prompt, asks an LLM to reason step-by-step (chain-of-thought), and issues a signal phase. The authors build LightGPT by imitating GPT-4 trajectories and refining them with a critic model. On ten datasets (real + synthetic) and a 15-person expert review, LLMLight with LightGPT matches or beats state-of-the-art RL and heuristic methods while being far cheaper to run than closed models like GPT-4. Main limits: single-intersection view, no camera-image inputs, and no pedestrian/bicycle modeling.

Problem Statement

Reinforcement-learning traffic controllers can be powerful but often fail to generalize, are hard to interpret, and require costly training. Off-the-shelf LLMs generalize and reason but lack traffic-specific data and can hallucinate. The paper asks: can an LLM be turned into an interpretable, generalizable, and cost-effective traffic-signal controller?

Main Contribution

LLMLight: a prompting workflow that verbalizes local traffic features and asks an LLM to pick a signal phase with chain-of-thought reasoning.

LightGPT: a TSC-specialized LLM trained by imitation fine-tuning on GPT-4 reasoning plus critic-guided policy refinement.

Extensive tests on ten datasets (seven real, three synthetic) showing competitive or superior ATT/AQL/AWT vs nine baselines and ten LLMs.

Prototype demo, human expert evaluation (15 experts), and a cost-effectiveness study showing much lower deployment cost than GPT-4.

Key Findings

LightGPT (Llama2-13B) yields low travel times on evaluated datasets.

NumbersATT ≈ 274.03 s on Jinan/Hangzhou (Table 2/8).

LLMLight maintains much lower waiting times than many RL methods when scaling to larger networks.

NumbersRL methods' waiting times were 57.8% and 49.8% longer than ours in large-network tests (Figure 5).

Human experts rated the system's reasoning and decisions positively.

NumbersEvaluated by 15 experts (4 traffic officers, 9 drivers, 2 AI specialists).

LightGPT is far cheaper to deploy than GPT-4 at scale.

NumbersAnnual cost for 100 lights: GPT-4 $1,680K vs LightGPT-13B $44.91K (Table 4).

Results

Average Travel Time (ATT)

Value274.03 s

BaselineGPT-4 ATT ≈ 275.26 s

Average Waiting Time (AWT)

Value43.24 s

BaselineGPT-4 AWT ≈ 46.61 s

Transfer robustness (waiting time increase in RL baselines)

Value57.8% and 49.8% longer waits

BaselineBest RL methods on large network

Who Should Care

What To Try In 7 Days

Run LLMLight in CityFlow on one intersection using local counts as text prompts to validate reasoning outputs.

Fine-tune an open LLM via LoRA on a small set of GPT-4 CoT trajectories and compare ATT/AWT vs your current controller.

Train a simple action-value critic from your simulator to filter/improve candidate LLM trajectories before deployment.

Agent Features

Memory

  • No explicit long-term memory beyond prompt (per- step observation-based)

Planning

  • Chain-of-Thought reasoning (stepwise analysis)
  • Critic-guided ranking refinement of action trajectories

Tool Use

  • CityFlow simulator (for training and evaluation)

Frameworks

  • LLMLight prompting workflow
  • Imitation fine-tuning + ranking-based policy refinement (CGPR)

Is Agentic

true

Architectures

  • Large Language Model (LLM) agent per intersection
  • LightGPT backbone (fine-tuned LLM variants)

Collaboration

  • Single-agent (local observation only); multi-agent cooperation noted as future work

Optimization Features

Token Efficiency

  • Top_p=1.0; temperature=0 or 0.1 for stability (Section 4.4)

Infra Optimization

  • Real-time settings tuned for 10–20 parallel intersections per machine

Model Optimization

  • LoRA

System Optimization

  • Batch control of multiple intersections per machine (details in Appendix A.4)

Training Optimization

  • Imitation fine-tuning on GPT-4 reasoning
  • Filtering trajectories with an action-value critic
  • Ranking-loss-based policy refinement (RBC)

Inference Optimization

  • Use mid-sized models (Llama3-8B, Llama2-13B) for lower latency

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single-intersection inputs only; no multi-agent coordination included.
  • No camera-image or multimodal inputs—relies on counts/sensor features.
  • Does not model pedestrians, bicycles, or other non-vehicle actors.

When Not To Use

  • When global coordination across many intersections is required out-of-the-box.
  • When decisions must incorporate camera vision or pedestrian safety signals.
  • Where regulatory constraints prohibit non-deterministic signal timing.

Failure Modes

  • LLM instruction-following failures or hallucinations (observed with ChatGPT-3.5).
  • Garbage sensor inputs lead to incorrect textual summaries and bad actions.
  • Single-intersection view can harm network-wide performance unless extended to multi-agent.

Core Entities

Models

  • LightGPT (Llama2-13B)
  • LightGPT (Llama3-8B)
  • LightGPT (Llama2-7B)
  • LightGPT (Qwen2-7B)
  • GPT-4
  • ChatGPT-3.5
  • Llama2-13B
  • Qwen2-0.5B
  • Qwen2-72B
  • Llama3-70B

Metrics

  • Average Travel Time (ATT)
  • Average Queue Length (AQL)
  • Average Waiting Time (AWT)

Datasets

  • Jinan (1/2/3/Extreme/24-hour)
  • Hangzhou (1/2/Extreme)
  • New York (1/2)
  • CityFlow (simulator)

Benchmarks

  • 10-dataset TSC benchmark (this work, mixed real+synthetic)