Agentic AI Papers — Parsed & Scored for Practitioners

MetaGPT: use human-style SOPs, role agents and runtime execution checks to improve multi-agent code generation

0.70

0.60

130

MetaGPT applies team-style SOPs and runtime test loops to LLM agents, producing more runnable code and fewer manual fixes—trade higher token costs for reduced engineering review time and higher delivery quality.

Key finding

High functional accuracy on public code benchmarks.

Numbers: Pass@1 = 85.9% and 87.7% on evaluated benchmarks

Let multiple copies of an LLM debate to improve reasoning and reduce hallucinations

0.50

0.60

0.40

85

If accuracy matters more than latency, running several LLM copies that debate can materially reduce wrong answers and hallucinations, producing higher-quality outputs for QA, math, and plan generation.

Key finding

Multiagent debate raises arithmetic accuracy from 67.0% to 81.8% on their test set.

Numbers: Arithmetic: 67.0% → 81.8% (Table 1)

Let LLMs translate problems and a classical planner find correct, often optimal, plans

0.70

0.60

0.70

84

LLM+P turns LLMs into reliable natural-language front ends for proven symbolic planners. That reduces execution risk and often lowers real-world costs (e.g., fewer extra robot trips). It avoids expensive LLM fine-tuning by delegating correctness to existing planners.

Key finding

LLM+P produced correct or optimal plans in most evaluated domains while LLM-only methods usually failed.

Numbers: BLOCKSWORLD 90% (LLM 15–20%); GRIPPERS 95% (LLM 35%) ; STORAGE 85% (LLM 0%)

OpenAGI: an open platform that lets LLMs plan and call specialist models to solve multi-step tasks

0.50

0.60

0.45

76

OpenAGI shows you can compose existing specialist models under LLM control and use RL-style tuning to make smaller, cheaper models competitive—useful for building product workflows that call vision, text, or web tools.

Key finding

A large, general LLM (GPT-4) achieves the highest overall OpenAGI scores in zero/few-shot.

Numbers: GPT-4 overall: 0.2378 (zero) -> 0.5281 (few)

ChatDev: multi-agent LLMs that chat to design, code, and test software

0.40

0.50

0.30

69

ChatDev makes prototyping software faster and more reliable by combining role-based LLM agents into a chained workflow that raises the chance code runs without heavy manual fixes.

Key finding

ChatDev generates more runnable software than baselines.

Numbers: Executability: ChatDev 0.88 vs GPT-Engineer 0.3583, MetaGPT 0.4145

ToolBench + DFSDT + retriever teach LLaMA-2 to use 16k+ real REST APIs with ChatGPT-based annotation and evaluation

0.70

63

If you build assistants that call external services, training on many real APIs plus a retriever and multi-path planning dramatically reduces manual engineering and makes open-source models practically competitive with closed systems.

Key finding

ToolBench covers 16,464 real REST APIs and 126,486 labeled instruction→solution pairs.

Numbers: 16,464 APIs; 126,486 instances; 469,585 real API calls

PaperQA: an agentic RAG that retrieves full-text papers, cites sources, and matches experts on a new LitQA benchmark

0.70

0.55

0.80

51

PaperQA shows agentic retrieval plus LLMs can deliver near-expert literature answers with reliable citations and low cost per query, making it practical for automated literature triage, fast reviews, and decision support.

Key finding

PaperQA achieves 69.5% accuracy on LitQA, slightly above human experts.

Numbers: PaperQA 69.5% vs Human 66.8% (LitQA, Table 2)

A modular graph framework that lets multiple LLM agents collaborate, create agents, and supervise each other

0.30

0.60

0.40

50

Modular LLM agents let teams split complex workflows, add verifiers to reduce costly errors, and plug in APIs safely — but they add orchestration costs and governance requirements.

Key finding

Agents can be modeled as tuples (L, R, S, C, H) to standardize behavior and permissions.

ToolQA — a benchmark that forces LLMs to use external tools, not memorized facts

0.30

0.40

0.20

39

If your product must use live or private data, you need tested tool integration and source selection; relying on a base LLM risks wrong or outdated answers.

Key finding

Standard LLMs that do not use external tools fail on ToolQA.

Numbers: ChatGPT avg success: 5.6% (easy), ~2% (hard)

OptiGuide: use LLMs to translate plain-English what‑if questions into solver code and human explanations without sending private data

0.70

0.50

0.70

32

OptiGuide speeds what‑if and root‑cause analysis for planners, reduces engineer on‑call cycles, and keeps sensitive data in‑house while surfacing solver decisions in plain English.

Key finding

GPT‑4 achieves high accuracy answering quantitative supply‑chain questions when given examples in the prompt.

Numbers: ≈93% average accuracy (GPT‑4, in‑distribution)

LLMs fail at autonomous planning (~3% success) but their plans can be repaired and slightly help humans

1.00

0.60

0.40

31

If you plan to use LLMs for automated action sequencing or workflows, don't run them unsupervised — they rarely produce correct plans; use them as idea generators and pair with a certified planner or human review.

Key finding

LLMs rarely produce correct executable plans when used alone.

Numbers: GPT-3: 6/600 (1%); Instruct-GPT3: 41/600 (6.8%); BLOOM: 4/250 (1.6%); paper cites ≈3% average

Survey: five practical ways LLMs are used to plan agent behavior

0.40

0.60

29

LLM-driven planning can automate complex multi-step tasks, but higher success usually requires more model calls and tokens, so balance accuracy needs with token cost and latency.

Key finding

Spending more tokens (more generated ‘thinking’) tends to raise success.

Numbers: ALFWorld SR: ReAct 0.57 -> Reflexion 0.71; EX($): 152.18 -> 220.17 (Table 2).

CogEval: systematic tests show LLMs fail at cognitive maps and multi‑step planning

0.30

0.60

0.20

22

Do not assume LLMs can plan multi‑step tasks from text alone; failures scale with graph complexity and can cause incorrect or looping actions in planning applications.

Key finding

LLM, graph, domain, and condition strongly predict performance.

Numbers: LLM χ2=2357.87; graph χ2=3431.53; condition χ2=2080.04; domain χ2=458.74 (all p<.001)

Use a pre-trained LLM (GPT-3.5) as a zero-shot search operator and distill it into a white-box linear operator for MOEA/D

0.40

0.60

0.50

21

You can prototype new evolutionary operators with natural-language prompts and then distill them into cheap, explainable operators — reducing expert design time and cutting API cost after distillation.

Key finding

MOEA/D-LLM (GPT-3.5) produces competitive hypervolume (HV) on five real engineering RE instances.

Numbers: RE21 HV: 0.7936 vs MOEA/D 0.781 (Table I)

Practical survey of single- vs. multi-agent designs, planning steps, and tool calling trade-offs

0.60

0.50

19

Choose single agents for narrow, tool-driven tasks and multi-agent teams for complex, parallel workflows; add clear leadership, role prompts, and message filtering to improve speed and reliability.

Key finding

ReAct reduces factual hallucination versus Chain-of-Thought on HotpotQA.

Numbers: 6% hallucination (ReAct) vs 14% (CoT) on HotpotQA

AgentClinic: interactive, multimodal simulations that stress-test LLMs on real-style clinical decision making

0.30

0.70

0.40

18

Static medical QA overstates real-world performance. Interactive, multimodal tests reveal gaps in data gathering, tool use, and bias handling that directly affect safety and product trust.

Key finding

Interactive, sequential format is harder than static QA.

Numbers: Diagnostic accuracy can fall below 10% of static baseline (paper statement).

WebAgent: combine an HTML-specialist LLM and a code LLM to plan, summarize long pages, and act by generating Python for real websites

0.60

0.70

0.60

16

WebAgent shows a practical path to robust web automation: use a small specialist model to understand long HTML and a capable code-generating LLM to act. That reduces brittle failures on real sites and drastically raises task success in human-supervised runs.

Key finding

Modular WebAgent dramatically improves real-site success rates.

Numbers: Success: real-estate 65% vs 10%; social-media 70% vs 20%; map 80% vs 10%

Survey: how LLMs and LLM-based agents reshape software engineering workflows

0.60

0.45

0.60

15

LLM-based agents enable higher automation for multi-step engineering tasks (planning, tool use, testing) and often raise real pass rates and reduce human iteration; single LLMs remain cheaper for isolated code generation or simple analysis.

Key finding

Survey corpus and venue split — the review covers 139 papers and many are preprints.

Numbers: 139 papers; arXiv accounts for 40.3% of papers

ReWOO separates planning from fetching evidence to cut repeating prompt tokens and run smaller models

0.70

0.60

0.80

15

ReWOO cuts API token usage and hosting cost by separating planning from tool calls, so multi-step tool-using pipelines can run cheaper and scale with smaller models.

Key finding

ReWOO reduces token use on HotpotQA by about 5× compared to an observation-dependent ALM (ReAct).

Numbers: ReAct 9795.1 tokens vs ReWOO 1986.2 tokens (HotpotQA)

MINT: a compact benchmark that tests LLMs on multi-turn tool use and natural-language feedback

0.60

0.50

15

Interactive tool use and short user feedback materially change model success; measuring multi-turn behavior prevents wrong model choices and mispriced evaluation costs.

Key finding

Tool interaction gives consistent, per-turn success gains.

Numbers: 1–8% absolute SR gain per extra tool turn (micro-avg across tasks)

AdaPlanner: LLM planner that adaptively refines code-style plans from environment feedback

0.60

0.70

13

AdaPlanner cuts dependence on large labeled datasets and repeated LLM calls by adaptively revising code-style plans, saving annotation and API cost while improving performance on long-horizon text tasks.

Key finding

AdaPlanner achieves 91.79% overall success on 134 ALFWorld tasks.

Numbers: Success rate 91.79% (ALFWorld Table 2).

TravelPlanner: a realistic travel-planning benchmark — GPT-4 reaches only 0.6% full success on test tasks

1.00

0.70

0.60

13

Current LLM agents are not yet reliable enough to fully automate complex multi-constraint planning; but they can draft plans quickly and cut human effort if paired with verification and robust data collection.

Key finding

State-of-the-art LLMs largely fail to produce fully feasible travel plans.

Numbers: GPT-4 final pass rate = 0.6% on test set (two-stage)

Agentless: a simple three-step workflow (localize, repair, validate) that matches or beats open-source agents on SWE-bench Lite while slasH‑

0.70

0.60

0.80

13

A focused, non-agentic pipeline cuts cost and engineering overhead while matching or exceeding many open-source agentic systems on repo-level bug fixes.

Key finding

AGENTLESS resolves 96 of 300 SWE-bench Lite problems

Numbers: 96/300 = 32.00%

AgentSims: a visual, multi-agent sandbox to build task-based LLM benchmarks quickly

0.50

0.60

13

AgentSims helps teams test language models in realistic, multi-step roles (e.g., mayor, employee). That reveals operational gaps not visible with static benchmarks and speeds prototyping for productized agents.

Key finding

Task-based evaluation reduces hackability, broadens tested abilities, and yields an objective pass rate.