How large language models (LLMs) are being used to coordinate, plan, and control teams of robots

February 6, 20257 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

4

Authors

Peihan Li, Zijian An, Shams Abrar, Lifeng Zhou

Links

Abstract / PDF

Why It Matters For Business

LLMs can speed up multi-robot coordination and simplify human instructions, but current limitations (math errors, hallucinations, latency) mean companies should pilot hybrid systems that pair LLMs for planning with verified controllers for execution.

Summary TLDR

This survey reviews how large language models (LLMs) are being applied to multi-robot systems (MRS). It organizes work into four levels: high-level task allocation, mid-level motion planning, low-level action generation, and human intervention. The paper catalogs communication architectures (centralized, decentralized, hybrid), multimodal extensions (VLMs, VLAs), common simulators and benchmarks (AI2-THOR, PyBullet, RoCoBench, BOLAA, COHERENT), and practical challenges: weak mathematical reasoning, hallucination, latency, multi-modal fusion, and sparse standardized benchmarks. It ends with concrete opportunities: fine-tuning/LoRA, RAG, lightweight task-specific models, and richer multi-modal

Problem Statement

Integrating LLMs into real multi-robot teams promises easier instruction, dynamic task allocation, and richer human‑robot interaction, but MRS impose unique constraints—coordination, real-time behavior, heterogeneous robot bodies, and field deployment—that current LLM methods struggle with due to reasoning gaps, hallucination, latency, and weak benchmarks.

Main Contribution

First focused survey of LLM use specifically for multi-robot systems (MRS).

A clear taxonomy: high-level task allocation, mid-level motion planning, low-level action generation, and human intervention.

Review of communication architectures: centralized, decentralized, and hybrid (e.g., CMAS, DMAS, HMAS).

Summary of simulators and benchmarks used (AI2-THOR, PyBullet, Habitat-MAS, RoCoBench, BOLAA, COHERENT).

Catalog of practical challenges and concrete research directions (fine-tuning, RAG, lightweight models, multimodal VLAs).

Key Findings

LLMs are being used at four operational levels in MRS: task allocation, motion planning, action generation, and human-in-the-loop.

LLMs show large failures on mathematical/logical reasoning tasks; performance can drop markedly when problem clauses change.

Numbersup to 65% performance drop reported

Server-based LLM inference can be too slow for real-time multi-robot loops.

Numbers15–30 seconds per step reported using GPT-4

Communication architecture affects success and cost: hybrid HMAS-2 had higher success on complex tasks, while centralized CMAS was token-efficient in small teams.

Retrieval-augmented generation (RAG) and fine-tuning (e.g., LoRA) are effective practical levers to reduce hallucination and improve domain fit.

Who Should Care

What To Try In 7 Days

Run a proof-of-concept: use an LLM for high-level task allocation and a traditional planner for low-level control in simulation.

Measure latency and token costs with centralized vs hybrid communication on a small team (3–6 robots).

Test LoRA fine-tuning on a small domain corpus and compare hallucination rates with/without RAG retrieval.

Agent Features

Memory

  • short-term session memory
  • retrospective/long-term memory

Planning

  • task allocation
  • motion planning
  • action generation
  • human-in-the-loop

Tool Use

  • LoRA
  • RAG (retrieval-augmented generation)
  • VLMs (vision-language models)
  • VLAs (vision-language-action models)

Frameworks

  • EMOS
  • RoCo
  • LLM-Flock
  • DART-LLM
  • GenSwarm
  • BOLAA

Is Agentic

true

Architectures

  • centralized
  • decentralized
  • hybrid
  • hierarchical

Collaboration

  • inter-agent dialogue
  • central planner coordination
  • iterative proposal-feedback loops

Optimization Features

Token Efficiency

  • centralized CMAS is token-efficient (reported)
  • prompt size reduction with RAG

Infra Optimization

  • onboard inference hardware for remote/field robots

Model Optimization

  • LoRA
  • model distillation

System Optimization

  • hybrid architectures to trade tokens vs steps

Training Optimization

  • synthetic dataset generation
  • task-specific fine-tuning

Inference Optimization

  • use smaller task-specific models
  • local deployment to cut latency

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Weak mathematical and numerical reasoning in LLMs for planning
  • Prone to hallucination; needs verification and RAG
  • High and variable latency for server-based models
  • Multi-modal fusion and VLA grounding remain immature for diverse sensors
  • Benchmarks skewed to indoor/household tasks; outdoor/unstructured tests scarce
  • Field deployment constrained by connectivity and onboard compute

When Not To Use

  • Time-critical low-level control loops requiring sub-second response
  • Precise numerical optimization or trajectory planning without symbolic solvers
  • Bandwidth-constrained remote deployments that cannot host local models

Failure Modes

  • Hallucinated plan leads to unsafe or infeasible robot actions
  • High latency causes missed control deadlines and mission failure
  • Inconsistent inter-agent messages create conflicting assignments
  • Overfitting to simulated scenarios and poor transfer to real robots

Core Entities

Models

  • GPT-4
  • GPT-3.5 Turbo
  • Llama 3.1
  • Claude (Anthropic)
  • DeepSeek-R1
  • Qwen-2.5
  • PaLI
  • CLIP

Metrics

  • task success rate
  • token efficiency
  • latency (s/step)
  • precision/robustness in manipulation

Datasets

  • MultiPlan
  • BEHAVIOR-1K
  • ALFRED

Benchmarks

  • RoCoBench
  • BOLAA
  • COHERENT-Benchmark
  • RoCoBench (human-robot manipulation)

Context Entities

Models

  • SmolVLM
  • Moondream 2B
  • PaliGemma-2 3B
  • Qwen2-VL 2B

Metrics

  • real-world deployment success
  • communication steps
  • replanning frequency

Datasets

  • Multi-robot simulation scenarios (BoxNet, BoxLift, warehouse)
  • Task-specific synthetic datasets

Benchmarks

  • Task-specific evaluation suites used by surveyed papers