Tool Augmentation Papers — Parsed & Scored for Practitioners

PaperQA: an agentic RAG that retrieves full-text papers, cites sources, and matches experts on a new LitQA benchmark

0.70

0.55

0.80

51

PaperQA shows agentic retrieval plus LLMs can deliver near-expert literature answers with reliable citations and low cost per query, making it practical for automated literature triage, fast reviews, and decision support.

Key finding

PaperQA achieves 69.5% accuracy on LitQA, slightly above human experts.

Numbers: PaperQA 69.5% vs Human 66.8% (LitQA, Table 2)

AgentClinic: interactive, multimodal simulations that stress-test LLMs on real-style clinical decision making

0.30

0.70

0.40

18

Static medical QA overstates real-world performance. Interactive, multimodal tests reveal gaps in data gathering, tool use, and bias handling that directly affect safety and product trust.

Key finding

Interactive, sequential format is harder than static QA.

Numbers: Diagnostic accuracy can fall below 10% of static baseline (paper statement).

WebAgent: combine an HTML-specialist LLM and a code LLM to plan, summarize long pages, and act by generating Python for real websites

0.60

0.70

0.60

16

WebAgent shows a practical path to robust web automation: use a small specialist model to understand long HTML and a capable code-generating LLM to act. That reduces brittle failures on real sites and drastically raises task success in human-supervised runs.

Key finding

Modular WebAgent dramatically improves real-site success rates.

Numbers: Success: real-estate 65% vs 10%; social-media 70% vs 20%; map 80% vs 10%

ReWOO separates planning from fetching evidence to cut repeating prompt tokens and run smaller models

0.70

0.60

0.80

15

ReWOO cuts API token usage and hosting cost by separating planning from tool calls, so multi-step tool-using pipelines can run cheaper and scale with smaller models.

Key finding

ReWOO reduces token use on HotpotQA by about 5× compared to an observation-dependent ALM (ReAct).

Numbers: ReAct 9795.1 tokens vs ReWOO 1986.2 tokens (HotpotQA)

MINT: a compact benchmark that tests LLMs on multi-turn tool use and natural-language feedback

0.60

0.50

15

Interactive tool use and short user feedback materially change model success; measuring multi-turn behavior prevents wrong model choices and mispriced evaluation costs.

Key finding

Tool interaction gives consistent, per-turn success gains.

Numbers: 1–8% absolute SR gain per extra tool turn (micro-avg across tasks)

API-Bank: a large, runnable benchmark and training set to measure and improve LLMs' API/tool use; includes Lynx, a fine-tuned model.

0.50

0.70

0.80

12

API-Bank lets product and engineering teams measure how reliably models call real APIs, cheaply create high-quality tool-use training data, and reduce costly manual labeling. Improving API reliability cuts user-facing errors like failed requests or wrong side effects.

Key finding

API-Bank provides an executable evaluation set: 73 APIs, 314 manually reviewed dialogues, 753 API calls.

Numbers: 73 APIs; 314 dialogues; 753 API calls

OpenAgents — an open web platform hosting data, plugin, and web‑browsing language agents

0.70

0.60

11

OpenAgents gives product teams a ready web UI and backend components to demo and deploy agent features fast, cutting integration time for data tasks, API workflows, and browser automation.

Key finding

Plugins Agent integrates over 200 third-party plugins/APIs.

Numbers: 200+ plugins (text mentions "over 200 plugins")

A practical review of how LLMs build, extend, and are tested as autonomous agents

0.40

0.50

0.60

9

LLM agents can automate complex multi-step digital tasks but are currently brittle; invest in tool integration, retrieval, and realistic evaluation before production to avoid failures and user trust loss.

Key finding

Agents built for realistic web tasks still perform far below humans.

Numbers: GPT-4 agent task success 14.41% vs human 78.24%

Train LLM-based agents end-to-end with RL and let them ask humans for help

0.60

0.70

0.60

4

AGILE lets production agents learn when to call humans and when to act, improving accuracy while controlling human cost. That makes it practical for customer support, medical QA, and recommendation systems where mistakes are costly.

Key finding

AGILE (agile-vic13b-ppo) achieves a higher average total score on ProductQA than the GPT-4 agent.

Numbers: Total score (short answers) 0.784 vs agile-gpt4-prompt 0.718; +9.2% rel. (Table 4)

xLAM: open-source models (1B–141B) plus a unified function-calling data pipeline that tops the Berkeley Function-Calling leaderboard

0.80

0.60

0.70

4

xLAM provides production-ready, open-source agent models and a reusable data pipeline that reduce dependence on proprietary models for function-calling and tool-heavy workflows, enabling lower-cost deployment and reproducible tool integration.

Key finding

Top overall accuracy on Berkeley Function-Calling Leaderboard v2.

Numbers: 87.31% overall accuracy (xLAM-8x22b-r, BFCL v2 cutoff 09/03/2024)

STRIDE: give an LLM a memory and small tools and it reliably follows algorithms for strategic decisions

0.40

0.65

0.45

4

STRIDE turns LLMs into reliable decision engines for algorithmic planning tasks by pairing language reasoning with small, auditable tools and memory; this lowers risk in automation that needs exact calculations or incentive-aware pricing.

Key finding

STRIDE finds optimal actions in tabular MDPs far more often than CoT baselines when given a single demonstration.

Numbers: Example: H=5,S=3,A=3 success rate STRIDE 0.98 vs 0.74 (best baseline)

A public index of 67 deployed agentic AI systems that exposes capability documentation but sparse safety disclosure.

0.60

0.50

0.70

3

Agentic systems are moving into products; you need to verify safety practices before integrating them because public capability docs are common but safety disclosures are rare.

Key finding

The index catalogs 67 deployed agentic AI systems.

Numbers: n = 67

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

0.60

0.70

0.60

3

Automates converting published code into reusable tools, cutting expert setup time and enabling rapid integration of domain-specific methods into production agents at a low per-tool cost.

Key finding

TOOLMAKER implemented 80% of benchmark tools (12 of 15)

Numbers: 12/15 tools correct

Automated agent-driven medical knowledge graphs improve medical QA and rival much larger models

0.60

0.70

3

An automated, confidence-scored medical knowledge graph lets smaller LLMs deliver near state-of-the-art medical QA, reducing compute cost and enabling more interpretable, up-to-date answers.

Key finding

AMG-RAG (8B) reaches F1 74.1% on MEDQA.

Numbers: F1 = 74.1% (MEDQA)

UrbanKGent: an LLM agent that builds city-scale knowledge graphs cheaper and more accurately using geospatial tools

0.70

0.65

0.80

3

UrbanKGent lets teams build large, practical city knowledge graphs with small open models, cutting inference costs roughly 20× and lowering data needs, so you can deploy KG-driven city apps faster and cheaper.

Key finding

Fine-tuned UrbanKGent-13B outperforms GPT-4 on UrbanKGC accuracy on evaluated datasets.

Numbers: NYC: +~15% (RTE) and +~14% (KGC) accuracy vs GPT-4 on evaluated splits

Teach an LLM to call NCBI Web APIs and cut hallucinations on genomics QA

0.60

0.50

3

Teaching LLMs to call domain APIs gives far more accurate, traceable answers for database-style biomedical queries than pure retrieval or base LLMs.

Key finding

GeneGPT outperforms retrieval and domain LLMs on evaluated genomics QA.

Numbers: Macro-average 0.83 (GeneGPT) vs 0.44 (New Bing) on GeneTuring

Turn CoT into a chat: let chat LLMs call tools step-by-step to improve math and multi-hop QA

0.60

0.45

0.50

3

If you deploy chat LLMs for complex reasoning, organizing the session as a multi-turn chat that can call calculators, equation solvers, or retrievers reduces end-to-end errors and integrates tools without heavy engineering.

Key finding

ChatCoT improves average MATH accuracy over the prior SOTA iterative method (PHP).

Numbers: MATH Avg: ChatCoT 39.4 vs PHP 36.5 (7.9% relative)

ToolACE: auto-generates 26k verified APIs and complex dialogs to teach LLMs reliable function calling

0.70

0.60

2

ToolACE lets mid-size LLMs (8B) learn practical API use by supplying large, diverse, and verified synthetic tool data—reducing reliance on proprietary APIs and enabling in-house fine-tuned agents for automation tasks.

Key finding

ToolACE builds a very large synthetic API pool.

Numbers: 26,507 APIs across 390 domains

GeneAgent: an LLM agent that queries biology databases to verify and improve gene‑set function explanations

0.70

0.60

2

GeneAgent reduces false functional claims by checking LLM outputs against curated biology databases, cutting manual validation time and producing more trustworthy gene‑set summaries for research pipelines.

Key finding

GeneAgent increases n‑gram and LCS name overlap over GPT-4 on evaluated datasets.

Numbers: ROUGE-1/ROUGE-L from 23.9%→31.0% (MsigDB); ROUGE-2 7.4%→15.5%

An agent that plans, calls visual tools, and uses a vision-based critic to boost multimodal VQA

0.60

0.50

2

MMCTAgent improves accuracy on hard visual QA tasks by combining planning, specialist vision tools, and an automated visual verifier — useful for analytics, media search, and QA over long videos, but it adds compute and tool dependencies.

Key finding

MMCTAgent yields higher accuracy than evaluated SOTA multimodal models on image benchmarks.

Numbers: MMVET: 74.24% (MMCT w/ critic) vs GPT-4V 60.2% (Table 1)

How LLM-based coding agents must earn developer trust to be useful

0.50

0.60

1

AI coding agents can cut developer time but only if they earn developer trust through verifiable outputs, provenance, and integrated review processes.

Key finding

Developer trust, not raw generation skill, is the main barrier to widespread adoption of AI software engineers.

AI agents boost capabilities but multiply inference cost, latency variance, and datacenter power needs.

0.30

0.60

0.90

1

AI agents can raise per-query compute and energy by 10s–100s×, driving much higher cloud costs and datacenter power needs; without cost-aware designs, agent features can become economically and environmentally unsustainable.

Key finding

Agentic systems issue many more LLM calls per request than single-turn models.

Numbers: Agents average 9.2× more LLM calls; LATS averages 71 LLM calls/request.

ContextAgent: a proactive LLM agent that uses wearable sensors to reason and call tools automatically

0.55

0.70

0.50

1

Context-aware proactive assistants can act without prompts, reducing user friction and automating multi-step tasks by calling real tools; this lowers manual work and enables new hands-free services for wearables.

Key finding

ContextAgent raises proactive-decision accuracy and tool-calling correctness over baselines on the main benchmark.

Numbers: Acc-P +8.5%, F1 +7.0%, Acc-Args +6.0% (Llama3.1-8B base)

A locally hosted LLM agent (RCAgent) that uses tools, snapshot keys, and trajectory-level self-consistency to improve cloud root-cause triag

0.70

0.60

1

RCAgent provides stronger automated RCA for private cloud data while reducing failed agent actions and surfacing platform-level issues to SREs faster.

Key finding

RCAgent substantially improves root-cause text quality over ReAct on the Flink offline set.

Numbers: METEOR: RCAgent 15.15 vs ReAct 6.44 (+8.71)