Agent Orchestration Papers — Parsed & Scored for Practitioners

OpenAGI: an open platform that lets LLMs plan and call specialist models to solve multi-step tasks

0.50

0.60

0.45

76

OpenAGI shows you can compose existing specialist models under LLM control and use RL-style tuning to make smaller, cheaper models competitive—useful for building product workflows that call vision, text, or web tools.

Key finding

A large, general LLM (GPT-4) achieves the highest overall OpenAGI scores in zero/few-shot.

Numbers: GPT-4 overall: 0.2378 (zero) -> 0.5281 (few)

A modular graph framework that lets multiple LLM agents collaborate, create agents, and supervise each other

0.30

0.60

0.40

50

Modular LLM agents let teams split complex workflows, add verifiers to reduce costly errors, and plug in APIs safely — but they add orchestration costs and governance requirements.

Key finding

Agents can be modeled as tuples (L, R, S, C, H) to standardize behavior and permissions.

BOLAA: orchestrating specialist LLM agents with a controller improves web navigation and reasoning on standard benchmarks

0.60

0.65

0.70

9

Splitting complex agent work into small, specialist LLMs coordinated by a controller can match or beat large single LLM agents and reduce compute cost by enabling smaller models to specialize.

Key finding

Orchestrating specialist agents (BOLAA) gives the best WebShop performance across many LLMs.

Numbers: gpt-3.5-turbo BOLAA reward=0.6567 vs ZS=0.5061 (Table 1)

Let an LLM program better agents in code: Meta Agent Search discovers agent workflows that beat hand‑designed agents on several benchmarks

0.40

0.80

0.60

7

Automated agent design can reduce manual engineering time and produce stronger task-specific agents, cutting error rates on QA and math tasks and enabling faster iteration on agent workflows.

Key finding

Meta Agent Search finds agents that substantially improve reading-comprehension performance over hand-designed agents.

Numbers: DROP F1 +13.6 pp (paper claim)

Domain-specific AI agents collaborate to find cross-domain knowledge

0.30

0.50

0.40

7

Orchestrated domain-specific agents can raise answer accuracy for cross-field queries, trading speed for higher-quality, context-aware results.

Key finding

Agents were seeded with domain literature to create domain-specific expertise.

Numbers: ≈1000 papers per agent (Section 2.1)

Generate editable BIM models from plain language by orchestrating LLM agents that write modeling code

0.60

6

Text2BIM lets designers describe early-stage buildings in plain language and get editable BIM models, reducing manual modeling effort and speeding concept-to-BIM workflows while preserving the ability to refine results in standard BIM tools.

Key finding

The framework produced editable IFC/BIM models for 25 diverse prompts with 534 generated runs.

Numbers: 534 IFC models generated (25 prompts × 3 LLMs × 3 repeats incl. intermediate runs)

Survey: hybrid LLM architectures (RAG, agents, verifiers) for complex question answering

0.70

0.50

0.80

6

For real-world complex Q&A, LLMs must be combined with retrieval, tools, verifiers and human feedback to get accuracy, auditability and privacy—this reduces risk and improves trust but raises cost and latency.

Key finding

Best-practice stacks now couple agentic controllers, retrieval-grounding, and verifier/PRM loops to answer complex questions.

A public index of 67 deployed agentic AI systems that exposes capability documentation but sparse safety disclosure.

0.60

0.50

0.70

3

Agentic systems are moving into products; you need to verify safety practices before integrating them because public capability docs are common but safety disclosures are rare.

Key finding

The index catalogs 67 deployed agentic AI systems.

Numbers: n = 67

Orchestrated Distributed Intelligence (ODI): orchestrate multiple AI agents with humans to turn systems of record into systems of action

0.40

0.65

0.70

3

Orchestrated agentic AI turns data repositories into real-time decision systems, boosting productivity, reducing manual work, and enabling strategic agility when paired with governance and change management.

Key finding

Many US economic processes remain heavily manual, creating room for AI-driven automation.

Numbers: Nearly 50% of US GDP involves processes with up to 90% manual labor

Practical survey and roadmap for four agent interoperability protocols (MCP, ACP, A2A, ANP)

0.70

0.40

0.70

3

Standardizing agent interfaces reduces engineering cost, improves security, and enables reusable agent services across teams and vendors.

Key finding

Four distinct protocols target different interoperability layers

Numbers: 4 protocols compared (MCP, ACP, A2A, ANP)

Croto: Orchestrating multiple LLM agent teams to jointly propose, prune, and synthesize better code and stories

0.35

0.60

0.45

3

Croto shows you can run multiple independent LLM teams, share and merge their intermediate outputs, and get measurably better code or narrative drafts—useful for prototyping, product ideation, and automating complex content that benefits from diverse perspectives.

Key finding

Croto raises overall software quality over a strong multi-agent baseline (ChatDev).

Numbers: Quality: Croto 0.840 vs ChatDev 0.779

LLaMAC: an actor-critic wrapper that coordinates many LLM-based agents with a TripletCritic and token‑efficient feedback

0.60

0.70

0.60

3

If you run many LLM-driven agents, LLaMAC lowers LLM calls and increases task success by coordinating agents through a centralized critic plus selective actor feedback.

Key finding

LLaMAC was tested on multi-agent resource allocation with up to 50 agents and maintains stable learning.

Numbers: evaluations with 3,5,10,20,50 agents

An Internet-like platform that links diverse LLM agents into dynamic teams and chat groups

0.60

0.55

0.40

3

IoA lets you combine existing specialized agents into coordinated teams to raise task success without re-training models; expect better QA and tool use at the cost of coordination tokens and some extra infra.

Key finding

IoA substantially improves open-ended instruction wins when it orchestrates third-party agents.

Numbers: Win rate vs AutoGPT: 76.5%; vs Open Interpreter: 63.4%

Train the controller to shorten the critical execution path so parallel agent teams run much faster without losing accuracy

0.60

0.70

2

When you run multiple LLM-based agents in parallel, overall response time depends on the slowest chain of steps (the critical path). Training the orchestration policy to minimize that path reduces latency a lot without sacrificing accuracy, which helps interactive products and time-sensitive workflows.

Key finding

LAMaS reduced critical-path length substantially compared to MaAS on three benchmarks.

Numbers: CP len reduced by 38.0% (GSM8K), 42.4% (HumanEval), 46.1% (MATH)

A unified protocol and toolkit (Exgentic) to fairly evaluate general-purpose agents across diverse benchmarks

0.60

0.70

2

A single evaluation protocol reduces integration cost, reveals whether you should invest in a better LLM or in agent engineering, and helps pick cost-performance tradeoffs for production.

Key finding

Model choice explains far more performance variance than agent architecture.

Numbers: Model choice explains 28.2% vs agent architecture 0.6% of variance

Use a learned manager to steer LLM agents by changing who sees what — raising network cooperation without rewiring links

0.40

0.60

2

Adaptive control of who sees what is a low-cost governance lever: you can raise coordination among autonomous agents without changing incentives or network wiring, cutting engineering and policy friction.

Key finding

A learned RL manager drives full network cooperation in the simulated PD runs.

Numbers: Reach 100% mutual cooperation (CC) by timestep 10 on average (RL method)

A multi-layer Agentic AI architecture for faster, adaptive emergency response over next‑gen networks

0.40

0.50

0.60

2

Putting autonomous agents at the edge can halve response times and improve decision accuracy in emergency services, but it raises infrastructure and governance costs for compute, bandwidth and accountability.

Key finding

Agentic AI cut average response time from ~8.6 minutes to 3.2 minutes compared to non-agentic baselines.

Numbers: Response time: 8.6m -> 3.2m (Table I)

When big LLMs get better, multi-agent setups lose much of their edge — use targeted upgrades and hybrid routing to save cost.

0.60

0.50

0.70

2

MAS increases accuracy only for very hard tasks but multiplies deployment cost; hybrid routing/cascade lets you save API and latency costs while keeping or improving accuracy.

Key finding

MAS accuracy advantage shrinks as underlying LLMs improve.

Numbers: MetaGPT-HumanEval: ChatGPT SAS→67% vs MAS→87.7% (10.7% gain); Gemini-2.0: SAS→90.2% vs MAS→93.2% (3.0% gain).

A drag-and-drop, no-code UI + APIs for building, testing, profiling, and exporting multi-agent workflows

0.30

0.60

2

AutoGen Studio shortens the gap between idea and working multi-agent prototype. Teams can visually assemble agents, track costs and tool failures, and export workflows to run as APIs or Docker containers. This accelerates experimentation and handoff to engineers while keeping reproducible component specs.

Key finding

Wide early adoption and active feedback loop

Numbers: 200K+ installs in 5 months; >135 GitHub issues

An LLM conductor that chains music models and keeps a shared music state for iterative loop creation

0.50

0.60

0.40

2

Loop Copilot shows how an LLM can orchestrate specialized models to speed up prototyping and ideation in music; apply it to demo generation, rapid iteration, and studio assistants while planning for tighter DAW integration and finer controls.

Key finding

Participants found Loop Copilot usable

Numbers: SUS mean = 75.31 ± 15.32

AgentOps: a six-stage automation pipeline to observe, analyze, and auto-optimize multi-agent AI

0.60

0.50

0.60

1

AgentOps reduces costly downtime and manual triage by automating detection, root-cause analysis, and runtime fixes for multi-agent LLM systems.

Key finding

Few organizations run dedicated observability for agentic AI.

Numbers: 8% of organizations (survey refs [2],[3])

Tippy: a production-ready multi-agent system that automates drug discovery lab workflows

0.70

0.60

1

Tippy shows how to turn multi-agent lab automation from a concept into a deployable platform, enabling more automated DMTA cycles, reproducible deployments, and scalable instrument orchestration.

Key finding

Tippy uses five specialized agents (Supervisor, Molecule, Lab, Analysis, Report) plus a Safety Guardrail.

Numbers: 5 specialized agents

Practical blueprint for making enterprise APIs 'agent-ready' for autonomous AI agents

0.50

0.60

1

If you plan to let AI agents use your APIs, you must redesign endpoints, headers, and governance now to avoid outages, security gaps, and surprise costs.

Key finding

Traditional REST/GraphQL/gRPC APIs are poorly matched to autonomous, iterative agent behavior.

MAO: multi‑agent LLM pipeline that generates and repairs BPMN process models

0.60

0.70

1

MAO automates BPMN drafting, reducing time and per‑model cost while producing models closer to reference designs than many human modelers on tested cases.

Key finding

MAO outperformed manual modelers on four FG‑C cases.

Numbers: MAO surpassed 89%, 61%, 52%, 75% of human models (datasets 1–4)