How large language models (LLMs) are being used to coordinate, plan, and control teams of robots

Overview

Decision SnapshotNeeds Validation

The survey compiles diverse early-stage systems and benchmarks; evidence is broad but mostly simulation and prototype experiments, so production readiness is limited without hybrid verification and latency fixes.

Citations4

Evidence Strength0.60

Confidence0.85

Risk Signals13

Trust Signals

Findings with numeric evidence: 2/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 60%

Authors

Peihan Li, Zijian An, Shams Abrar, Lifeng Zhou

Links

Abstract / PDF

Why It Matters For Business

LLMs can speed up multi-robot coordination and simplify human instructions, but current limitations (math errors, hallucinations, latency) mean companies should pilot hybrid systems that pair LLMs for planning with verified controllers for execution.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This survey reviews how large language models (LLMs) are being applied to multi-robot systems (MRS). It organizes work into four levels: high-level task allocation, mid-level motion planning, low-level action generation, and human intervention. The paper catalogs communication architectures (centralized, decentralized, hybrid), multimodal extensions (VLMs, VLAs), common simulators and benchmarks (AI2-THOR, PyBullet, RoCoBench, BOLAA, COHERENT), and practical challenges: weak mathematical reasoning, hallucination, latency, multi-modal fusion, and sparse standardized benchmarks. It ends with concrete opportunities: fine-tuning/LoRA, RAG, lightweight task-specific models, and richer multi-modal

Problem Statement

Integrating LLMs into real multi-robot teams promises easier instruction, dynamic task allocation, and richer human‑robot interaction, but MRS impose unique constraints—coordination, real-time behavior, heterogeneous robot bodies, and field deployment—that current LLM methods struggle with due to reasoning gaps, hallucination, latency, and weak benchmarks.

Main Contribution

First focused survey of LLM use specifically for multi-robot systems (MRS).

A clear taxonomy: high-level task allocation, mid-level motion planning, low-level action generation, and human intervention.

Key Findings

LLMs are being used at four operational levels in MRS: task allocation, motion planning, action generation, and human-in-the-loop.

Practical UseUse LLMs for high-level decomposition and coordination, but pair them with controllers or planners for low-level, safety-critical control.

Evidence RefAbstract; Sec.4

LLMs show large failures on mathematical/logical reasoning tasks; performance can drop markedly when problem clauses change.

Numbersup to 65% performance drop reported

Practical UseAvoid relying on raw LLM outputs for precise numeric planning—use symbolic solvers, verification layers, or hybrid pipelines.

Evidence Ref[87] (Mirzadeh et al.) Sec.7.1

What To Try In 7 Days

Run a proof-of-concept: use an LLM for high-level task allocation and a traditional planner for low-level control in simulation.

Measure latency and token costs with centralized vs hybrid communication on a small team (3–6 robots).

Test LoRA fine-tuning on a small domain corpus and compare hallucination rates with/without RAG retrieval.

Agent Features

Memory

short-term session memoryretrospective/long-term memory

Planning

task allocationmotion planningaction generationhuman-in-the-loop

Tool Use

LoRARAG (retrieval-augmented generation)VLMs (vision-language models)VLAs (vision-language-action models)

Frameworks

EMOSRoCoLLM-FlockDART-LLMGenSwarmBOLAA

Is Agentic

Yes

Architectures

centralizeddecentralizedhybridhierarchical

Collaboration

inter-agent dialoguecentral planner coordinationiterative proposal-feedback loops

Optimization Features

Token Efficiency

centralized CMAS is token-efficient (reported)prompt size reduction with RAG

Infra Optimization

onboard inference hardware for remote/field robots

Model Optimization

LoRAmodel distillation

System Optimization

hybrid architectures to trade tokens vs steps

Training Optimization

synthetic dataset generationtask-specific fine-tuning

Inference Optimization

use smaller task-specific modelslocal deployment to cut latency

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Weak mathematical and numerical reasoning in LLMs for planning

Prone to hallucination; needs verification and RAG

When Not To Use

Time-critical low-level control loops requiring sub-second response

Precise numerical optimization or trajectory planning without symbolic solvers

Failure Modes

Hallucinated plan leads to unsafe or infeasible robot actions

High latency causes missed control deadlines and mission failure

Core Entities

Models

GPT-4GPT-3.5 TurboLlama 3.1Claude (Anthropic)DeepSeek-R1Qwen-2.5PaLICLIP

Metrics

task success ratetoken efficiencylatency (s/step)precision/robustness in manipulation

Datasets

MultiPlanBEHAVIOR-1KALFRED

Benchmarks

RoCoBenchBOLAACOHERENT-BenchmarkRoCoBench (human-robot manipulation)

Context Entities

Models

SmolVLMMoondream 2BPaliGemma-2 3BQwen2-VL 2B

Metrics

real-world deployment successcommunication stepsreplanning frequency

Datasets

Multi-robot simulation scenarios (BoxNet, BoxLift, warehouse)Task-specific synthetic datasets

Benchmarks

Task-specific evaluation suites used by surveyed papers

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs are being used at four operational levels in MRS: task allocation, motion planning, action generation, and human-in-the-loop.

LLMs show large failures on mathematical/logical reasoning tasks; performance can drop markedly when problem clauses change.

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding