197 papers found

MetaGPT: use human-style SOPs, role agents and runtime execution checks to improve multi-agent code generation

0.70
0.60
0.60
130

MetaGPT applies team-style SOPs and runtime test loops to LLM agents, producing more runnable code and fewer manual fixes—trade higher token costs for reduced engineering review time and higher delivery quality.

Key finding

High functional accuracy on public code benchmarks.

Numbers: Pass@1 = 85.9% and 87.7% on evaluated benchmarks

ChatDev: multi-agent LLMs that chat to design, code, and test software

0.40
0.50
0.30
69

ChatDev makes prototyping software faster and more reliable by combining role-based LLM agents into a chained workflow that raises the chance code runs without heavy manual fixes.

Key finding

ChatDev generates more runnable software than baselines.

Numbers: Executability: ChatDev 0.88 vs GPT-Engineer 0.3583, MetaGPT 0.4145

Survey: five practical ways LLMs are used to plan agent behavior

0.40
0.60
0.60
29

LLM-driven planning can automate complex multi-step tasks, but higher success usually requires more model calls and tokens, so balance accuracy needs with token cost and latency.

Key finding

Spending more tokens (more generated ‘thinking’) tends to raise success.

Numbers: ALFWorld SR: ReAct 0.57 -> Reflexion 0.71; EX($): 152.18 -> 220.17 (Table 2).

Use a pre-trained LLM (GPT-3.5) as a zero-shot search operator and distill it into a white-box linear operator for MOEA/D

0.40
0.60
0.50
21

You can prototype new evolutionary operators with natural-language prompts and then distill them into cheap, explainable operators — reducing expert design time and cutting API cost after distillation.

Key finding

MOEA/D-LLM (GPT-3.5) produces competitive hypervolume (HV) on five real engineering RE instances.

Numbers: RE21 HV: 0.7936 vs MOEA/D 0.781 (Table I)

Mobile-Agent: operate mobile apps from screenshots using visual perception

0.70
0.60
0.60
11

You can automate mobile UI flows without OS hooks or XML access. That lowers integration cost for cross-device automation, testing, and accessibility tools and works on devices where system metadata is unavailable.

Key finding

High task success on simple app instructions

Numbers: Success (Instruction1) = 0.91

AgentLite: tiny open-source toolkit to rapidly prototype task-oriented and multi-agent LLM systems

0.60
0.40
0.50
10

AgentLite reduces code overhead for prototyping LLM agents so engineering teams can test agent ideas quickly without a heavy framework or large code refactor.

Key finding

AgentLite is small and focused: core codebase is under 1,000 lines.

Numbers: AgentLite core lines = 959; LangChain = 248,650 (Table 1)

An LLM agent that plans CRISPR experiments, designs guides and protocols, and was validated in a wet‑lab knockout

0.60
0.70
0.60
9

Automating CRISPR design reduces expert time, speeds prototyping, and lowers error risk in early‑stage research; it can cut planning cycles and standardize lab protocols for teams without CRISPR specialists.

Key finding

Domain‑augmented agent scored higher than general ChatGPT on expert design ratings.

Numbers: 12 experts; 15 rating scale; CRISPR‑GPT > ChatGPT 3.5/4 across Accuracy, Reasoning, Completeness, Conciseness

Let an LLM program better agents in code: Meta Agent Search discovers agent workflows that beat hand‑designed agents on several benchmarks

0.40
0.80
0.60
7

Automated agent design can reduce manual engineering time and produce stronger task-specific agents, cutting error rates on QA and math tasks and enabling faster iteration on agent workflows.

Key finding

Meta Agent Search finds agents that substantially improve reading-comprehension performance over hand-designed agents.

Numbers: DROP F1 +13.6 pp (paper claim)

VillagerAgent: use DAGs to coordinate LLM agents and a new VillagerBench in Minecraft

0.40
0.60
0.55
6

Modeling task dependencies explicitly with a DAG reduces coordination errors and token cost in LLM-driven multi-agent workflows, making automated team coordination cheaper and more reliable in simulated task domains.

Key finding

VillagerAgent cuts hallucination-driven failures vs AgentVerse on cooking tasks.

Numbers: Failure rate 18.2% (VillagerAgent) vs 44.4% (AgentVerse)

Use scene graphs + LLMs to split long robot goals into short sub-goals so classical planners solve them fast and reliably

0.60
0.45
0.55
6

DELTA turns large, slow planning problems into fast, reliable sub-problems so robots can plan long household workflows quickly and with higher success, cutting compute and time costs when paired with a strong LLM.

Key finding

DELTA with GPT-4o achieves highest success rates across evaluated domains.

Numbers: PC 98%, Dining 100%, Cleaning 80%, Office 74.67% (Table II)

Train LLMs to plan with abstract placeholders, then fill them with tools to reason faster and more accurately

0.70
0.60
0.60
5

CoA makes multi-step tool use both more accurate and faster by separating plan generation from tool calls; this reduces arithmetic bugs and shortens latency when pipelines must call external APIs.

Key finding

CoA improves QA accuracy on evaluated math benchmarks.

Numbers: GSM8K: +~2.9~6.8 pp absolute (varies by model); average ~7.5% reported

TDAG: dynamically split complex tasks and auto-generate subagents to improve multi-step agent performance

0.60
0.60
0.50
5

TDAG reduces failure cascades and improves partial progress tracking, so agent-driven multi-step workflows are more reliable and auditable.

Key finding

TDAG achieves higher average score on ItineraryBench than baselines

Numbers: TDAG avg 49.08 vs ReAct 43.02 (Table 2)

Multi-agent LLaMA 3 workflow matches expert prompts for detecting cognitive concerns in clinical notes

0.60
0.60
0.70
4

Automated agent pipelines can cut human prompt-tuning time and reach near-expert accuracy on clinical-note screening, lowering labor cost and speeding deployment in health systems.

Key finding

Agentic prompt AP2 reached F1-score 0.91 on the prompt-refinement dataset.

Numbers: F1 = 0.91 (Table 3)

A modular blueprint for running reliable multi-agent workflows with planning, tool refinement, and episodic memory

0.60
0.50
0.40
4

Gives a reusable engineering blueprint to run reliable, auditable multi-agent automation across existing enterprise systems without retraining models.

Key finding

Narrow, persona-like agents perform more reliably than broad agents.

Train a search-based LLM agent to self-improve via iterative synthetic trajectories and distill it into much smaller models.

0.60
0.60
0.70
4

You can cheaply build and improve multi-step question-answering agents without large human-labeled trajectory datasets, and then deploy much smaller, cheaper models that preserve most teacher performance on similar tasks.

Key finding

Self-improvement raises small-model auto-eval accuracy substantially.

Numbers: PaLM 2-XS: 44.7±3.1% -> 65.9±2.6% (pilot to 2nd gen)

STRIDE: give an LLM a memory and small tools and it reliably follows algorithms for strategic decisions

0.40
0.65
0.45
4

STRIDE turns LLMs into reliable decision engines for algorithmic planning tasks by pairing language reasoning with small, auditable tools and memory; this lowers risk in automation that needs exact calculations or incentive-aware pricing.

Key finding

STRIDE finds optimal actions in tabular MDPs far more often than CoT baselines when given a single demonstration.

Numbers: Example: H=5,S=3,A=3 success rate STRIDE 0.98 vs 0.74 (best baseline)

Sum2Act: a router + state-manager pipeline that makes LLMs call many real APIs reliably

0.60
0.50
0.50
3

If your product needs reliable multi-step interactions with many third-party APIs (search, image tools, web services), a small router + summarizing state manager can boost success and reduce repeated failures with little engineering overhead.

Key finding

Sum2Act raises average Pass Rate to 70.0% on ToolBench using ChatGPT

Numbers: Pass Rate avg: Sum2Act 70.0% vs DFSDT 67.0% vs ReAct 41.1%

MAP: split planning into specialized LLM modules to get more reliable multi-step plans

0.50
0.60
0.25
3

If your product needs reliable multi-step decisions, splitting planning into specialized LLM modules reduces incorrect actions and improves transferability; you can also trade off accuracy and cost by using smaller models and caching.

Key finding

MAP solved the Valuepath graph task on evaluated problems

Numbers: 100% solved (Valuepath, Table 4)

Agentic AI breaks the old rules of human-AI teams — shared awareness helps, but continuous governance is required

0.30
0.70
0.60
2

Agentic AI can change behavior and priorities after deployment; firms must monitor intermediate commitments, add decision checkpoints, and align incentives so automation doesn't drift from strategic goals.

Key finding

Agentic AI creates three structural uncertainties—action trajectories, generative outputs, and evolving objectives—that differ qualitatively from task-bound systems.

OctoTools: a training-free planner+executor agent that plugs in tools to boost multi-step reasoning

0.70
0.60
0.45
2

OctoTools turns general LLMs into practical, multi-step assistants by plugging in specialized tools and an explicit planner; this improves correctness on domain tasks and lets teams add domain tools without retraining models.

Key finding

OctoTools raises average accuracy from 49.2% to 58.5% across 16 benchmarks.

Numbers: Avg accuracy OctoTools 58.5% vs zero-shot 49.2% (∆ +9.3%)

LLM-powered multi-agent system automates WeChat Pay UAT and achieves 88.6% Pass@1

0.80
0.60
0.70
2

Automates the most labor-intensive step in UAT (script generation), cutting manual tester time and making daily regression testing faster and more consistent.

Key finding

Multi-agent system greatly improves pass rates versus a single-agent LLM.

Numbers: Pass@1: 88.55% vs 22.65% (Table 4)

Agent-E: hierarchical web agent with DOM denoising and change-observation — 73.2% on WebVoyager

0.45
0.60
0.50
2

A hierarchical agent with DOM denoising and action feedback raises generic web automation success to ~73% and gives actionable signals (self-aware failures) that support safe fallbacks and learning pipelines.

Key finding

Agent-E reached 73.2% task success on the WebVoyager benchmark.

Numbers: 73.2% overall success (WebVoyager)

LLM agent that perceives landmarks, stores memories, and plans to navigate cities without step-by-step instructions

0.60
0.70
0.50
2

PReP shows you can build autonomous navigation agents that operate without explicit step-by-step instructions and with far less RL data, enabling faster prototyping for navigation assistants, accessibility tools, and search-and-rescue prototypes.

Key finding

PReP substantially improves success rate over reactive and other LLM prompting baselines.

Numbers: Average SR ≈ 54% across four city test sets

Distill the planner, not the solver: small models can learn decomposition cheaply and generalize

0.60
0.60
0.75
2

You can offload planning to a cheap local model and keep expensive models for final solving, cutting inference cost while keeping accuracy on reasoning tasks.

Key finding

Distilling only the decomposer preserves or improves two-stage reasoning performance versus a single-stage approach on evaluated benchmarks.

Numbers: GSM8K EM: static two-stage ~65.13 vs single-stage ~20.32 (Table 1)