Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
4
Why It Matters For Business
LLMs can speed up multi-robot coordination and simplify human instructions, but current limitations (math errors, hallucinations, latency) mean companies should pilot hybrid systems that pair LLMs for planning with verified controllers for execution.
Summary TLDR
This survey reviews how large language models (LLMs) are being applied to multi-robot systems (MRS). It organizes work into four levels: high-level task allocation, mid-level motion planning, low-level action generation, and human intervention. The paper catalogs communication architectures (centralized, decentralized, hybrid), multimodal extensions (VLMs, VLAs), common simulators and benchmarks (AI2-THOR, PyBullet, RoCoBench, BOLAA, COHERENT), and practical challenges: weak mathematical reasoning, hallucination, latency, multi-modal fusion, and sparse standardized benchmarks. It ends with concrete opportunities: fine-tuning/LoRA, RAG, lightweight task-specific models, and richer multi-modal
Problem Statement
Integrating LLMs into real multi-robot teams promises easier instruction, dynamic task allocation, and richer human‑robot interaction, but MRS impose unique constraints—coordination, real-time behavior, heterogeneous robot bodies, and field deployment—that current LLM methods struggle with due to reasoning gaps, hallucination, latency, and weak benchmarks.
Main Contribution
First focused survey of LLM use specifically for multi-robot systems (MRS).
A clear taxonomy: high-level task allocation, mid-level motion planning, low-level action generation, and human intervention.
Review of communication architectures: centralized, decentralized, and hybrid (e.g., CMAS, DMAS, HMAS).
Summary of simulators and benchmarks used (AI2-THOR, PyBullet, Habitat-MAS, RoCoBench, BOLAA, COHERENT).
Catalog of practical challenges and concrete research directions (fine-tuning, RAG, lightweight models, multimodal VLAs).
Key Findings
LLMs are being used at four operational levels in MRS: task allocation, motion planning, action generation, and human-in-the-loop.
LLMs show large failures on mathematical/logical reasoning tasks; performance can drop markedly when problem clauses change.
Server-based LLM inference can be too slow for real-time multi-robot loops.
Communication architecture affects success and cost: hybrid HMAS-2 had higher success on complex tasks, while centralized CMAS was token-efficient in small teams.
Retrieval-augmented generation (RAG) and fine-tuning (e.g., LoRA) are effective practical levers to reduce hallucination and improve domain fit.
Who Should Care
What To Try In 7 Days
Run a proof-of-concept: use an LLM for high-level task allocation and a traditional planner for low-level control in simulation.
Measure latency and token costs with centralized vs hybrid communication on a small team (3–6 robots).
Test LoRA fine-tuning on a small domain corpus and compare hallucination rates with/without RAG retrieval.
Agent Features
Memory
- short-term session memory
- retrospective/long-term memory
Planning
- task allocation
- motion planning
- action generation
- human-in-the-loop
Tool Use
- LoRA
- RAG (retrieval-augmented generation)
- VLMs (vision-language models)
- VLAs (vision-language-action models)
Frameworks
- EMOS
- RoCo
- LLM-Flock
- DART-LLM
- GenSwarm
- BOLAA
Is Agentic
true
Architectures
- centralized
- decentralized
- hybrid
- hierarchical
Collaboration
- inter-agent dialogue
- central planner coordination
- iterative proposal-feedback loops
Optimization Features
Token Efficiency
- centralized CMAS is token-efficient (reported)
- prompt size reduction with RAG
Infra Optimization
- onboard inference hardware for remote/field robots
Model Optimization
- LoRA
- model distillation
System Optimization
- hybrid architectures to trade tokens vs steps
Training Optimization
- synthetic dataset generation
- task-specific fine-tuning
Inference Optimization
- use smaller task-specific models
- local deployment to cut latency
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Weak mathematical and numerical reasoning in LLMs for planning
- Prone to hallucination; needs verification and RAG
- High and variable latency for server-based models
- Multi-modal fusion and VLA grounding remain immature for diverse sensors
- Benchmarks skewed to indoor/household tasks; outdoor/unstructured tests scarce
- Field deployment constrained by connectivity and onboard compute
When Not To Use
- Time-critical low-level control loops requiring sub-second response
- Precise numerical optimization or trajectory planning without symbolic solvers
- Bandwidth-constrained remote deployments that cannot host local models
Failure Modes
- Hallucinated plan leads to unsafe or infeasible robot actions
- High latency causes missed control deadlines and mission failure
- Inconsistent inter-agent messages create conflicting assignments
- Overfitting to simulated scenarios and poor transfer to real robots
Core Entities
Models
- GPT-4
- GPT-3.5 Turbo
- Llama 3.1
- Claude (Anthropic)
- DeepSeek-R1
- Qwen-2.5
- PaLI
- CLIP
Metrics
- task success rate
- token efficiency
- latency (s/step)
- precision/robustness in manipulation
Datasets
- MultiPlan
- BEHAVIOR-1K
- ALFRED
Benchmarks
- RoCoBench
- BOLAA
- COHERENT-Benchmark
- RoCoBench (human-robot manipulation)
Context Entities
Models
- SmolVLM
- Moondream 2B
- PaliGemma-2 3B
- Qwen2-VL 2B
Metrics
- real-world deployment success
- communication steps
- replanning frequency
Datasets
- Multi-robot simulation scenarios (BoxNet, BoxLift, warehouse)
- Task-specific synthetic datasets
Benchmarks
- Task-specific evaluation suites used by surveyed papers

