How large language models (LLMs) are being used to coordinate, plan, and control teams of robots

February 6, 20257 min

Overview

Decision SnapshotNeeds Validation

The survey compiles diverse early-stage systems and benchmarks; evidence is broad but mostly simulation and prototype experiments, so production readiness is limited without hybrid verification and latency fixes.

Citations4

Evidence Strength0.60

Confidence0.85

Risk Signals13

Trust Signals

Findings with numeric evidence: 2/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 60%

Authors

Peihan Li, Zijian An, Shams Abrar, Lifeng Zhou

Links

Abstract / PDF

Why It Matters For Business

LLMs can speed up multi-robot coordination and simplify human instructions, but current limitations (math errors, hallucinations, latency) mean companies should pilot hybrid systems that pair LLMs for planning with verified controllers for execution.

Who Should Care

Summary TLDR

This survey reviews how large language models (LLMs) are being applied to multi-robot systems (MRS). It organizes work into four levels: high-level task allocation, mid-level motion planning, low-level action generation, and human intervention. The paper catalogs communication architectures (centralized, decentralized, hybrid), multimodal extensions (VLMs, VLAs), common simulators and benchmarks (AI2-THOR, PyBullet, RoCoBench, BOLAA, COHERENT), and practical challenges: weak mathematical reasoning, hallucination, latency, multi-modal fusion, and sparse standardized benchmarks. It ends with concrete opportunities: fine-tuning/LoRA, RAG, lightweight task-specific models, and richer multi-modal

Problem Statement

Integrating LLMs into real multi-robot teams promises easier instruction, dynamic task allocation, and richer human‑robot interaction, but MRS impose unique constraints—coordination, real-time behavior, heterogeneous robot bodies, and field deployment—that current LLM methods struggle with due to reasoning gaps, hallucination, latency, and weak benchmarks.

Main Contribution

First focused survey of LLM use specifically for multi-robot systems (MRS).

A clear taxonomy: high-level task allocation, mid-level motion planning, low-level action generation, and human intervention.

Key Findings

LLMs are being used at four operational levels in MRS: task allocation, motion planning, action generation, and human-in-the-loop.

Practical UseUse LLMs for high-level decomposition and coordination, but pair them with controllers or planners for low-level, safety-critical control.

Evidence RefAbstract; Sec.4

LLMs show large failures on mathematical/logical reasoning tasks; performance can drop markedly when problem clauses change.

Numbersup to 65% performance drop reported

Practical UseAvoid relying on raw LLM outputs for precise numeric planning—use symbolic solvers, verification layers, or hybrid pipelines.

Evidence Ref[87] (Mirzadeh et al.) Sec.7.1

What To Try In 7 Days

Run a proof-of-concept: use an LLM for high-level task allocation and a traditional planner for low-level control in simulation.

Measure latency and token costs with centralized vs hybrid communication on a small team (3–6 robots).

Test LoRA fine-tuning on a small domain corpus and compare hallucination rates with/without RAG retrieval.

Agent Features

Memory
short-term session memoryretrospective/long-term memory
Planning
task allocationmotion planningaction generationhuman-in-the-loop
Tool Use
LoRARAG (retrieval-augmented generation)VLMs (vision-language models)VLAs (vision-language-action models)
Frameworks
EMOSRoCoLLM-FlockDART-LLMGenSwarmBOLAA
Is Agentic

Yes

Architectures
centralizeddecentralizedhybridhierarchical
Collaboration
inter-agent dialoguecentral planner coordinationiterative proposal-feedback loops

Optimization Features

Token Efficiency
centralized CMAS is token-efficient (reported)prompt size reduction with RAG
Infra Optimization
onboard inference hardware for remote/field robots
Model Optimization
LoRAmodel distillation
System Optimization
hybrid architectures to trade tokens vs steps
Training Optimization
synthetic dataset generationtask-specific fine-tuning
Inference Optimization
use smaller task-specific modelslocal deployment to cut latency

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Weak mathematical and numerical reasoning in LLMs for planning

Prone to hallucination; needs verification and RAG

When Not To Use

Time-critical low-level control loops requiring sub-second response

Precise numerical optimization or trajectory planning without symbolic solvers

Failure Modes

Hallucinated plan leads to unsafe or infeasible robot actions

High latency causes missed control deadlines and mission failure

Core Entities

Models

GPT-4GPT-3.5 TurboLlama 3.1Claude (Anthropic)DeepSeek-R1Qwen-2.5PaLICLIP

Metrics

task success ratetoken efficiencylatency (s/step)precision/robustness in manipulation

Datasets

MultiPlanBEHAVIOR-1KALFRED

Benchmarks

RoCoBenchBOLAACOHERENT-BenchmarkRoCoBench (human-robot manipulation)

Context Entities

Models

SmolVLMMoondream 2BPaliGemma-2 3BQwen2-VL 2B

Metrics

real-world deployment successcommunication stepsreplanning frequency

Datasets

Multi-robot simulation scenarios (BoxNet, BoxLift, warehouse)Task-specific synthetic datasets

Benchmarks

Task-specific evaluation suites used by surveyed papers