Tool Selection Papers — Parsed & Scored for Practitioners

OpenAGI: an open platform that lets LLMs plan and call specialist models to solve multi-step tasks

0.50

0.60

0.45

76

OpenAGI shows you can compose existing specialist models under LLM control and use RL-style tuning to make smaller, cheaper models competitive—useful for building product workflows that call vision, text, or web tools.

Key finding

A large, general LLM (GPT-4) achieves the highest overall OpenAGI scores in zero/few-shot.

Numbers: GPT-4 overall: 0.2378 (zero) -> 0.5281 (few)

KG-Agent: a tool-augmented autonomous 7B LLM that reasons step-by-step over knowledge graphs

0.60

0.65

0.70

12

You can get KG-backed, multi-hop reasoning without expensive closed LLM APIs by fine-tuning a 7B open model on ~10K program-like instructions, cutting cost and improving cross-domain use of external KGs.

Key finding

Instruction-tuned KG-Agent (LLaMA2-7B) improves KGQA F1 over prior baselines on in-domain tests.

Numbers: F1 gains: WebQSP +1.7%, CWQ +7.5%, GrailQA +2.7% (Sec 5.2, Table 2)

OpenAgents — an open web platform hosting data, plugin, and web‑browsing language agents

0.70

0.60

11

OpenAgents gives product teams a ready web UI and backend components to demo and deploy agent features fast, cutting integration time for data tasks, API workflows, and browser automation.

Key finding

Plugins Agent integrates over 200 third-party plugins/APIs.

Numbers: 200+ plugins (text mentions "over 200 plugins")

An LLM agent that plans CRISPR experiments, designs guides and protocols, and was validated in a wet‑lab knockout

0.60

0.70

0.60

9

Automating CRISPR design reduces expert time, speeds prototyping, and lowers error risk in early‑stage research; it can cut planning cycles and standardize lab protocols for teams without CRISPR specialists.

Key finding

Domain‑augmented agent scored higher than general ChatGPT on expert design ratings.

Numbers: 12 experts; 1–5 rating scale; CRISPR‑GPT > ChatGPT 3.5/4 across Accuracy, Reasoning, Completeness, Conciseness

RAISE: add a short-term scratchpad and retrieved examples, then fine-tune an LLM for better multi-turn agents

0.60

0.50

0.60

6

RAISE shows you can get better, cheaper domain chatbots by adding a short-term scratchpad and retrieved examples, and then fine-tuning on under 1k curated scenes.

Key finding

RAISE with fine-tuning gave the highest overall quality among tested frameworks on the in-house real-estate eval.

Numbers: Overall quality 7.71 (RAISE, fine-tuned Qwen-14B-Chat) on 100 eval scenes

Call web search, code execution and a 'Mind‑Map' memory agent to make LLMs do long, research-style reasoning

0.60

0.70

0.60

5

Adding a small set of high-quality agents (search, coding, structured memory) can raise correctness on complex, knowledge‑intensive tasks by ~10 percentage points, enabling faster research and automation at the cost of higher compute and external data reliance.

Key finding

Agentic Reasoning raised Humanity's Last Exam accuracy to 23.8%, improving the base model by 14.4 percentage points.

Numbers: 23.8% (Agentic w/ DeepSeek-R1) vs 9.4% (DeepSeek-R1); +14.4

INJECAGENT: 1,054 realistic tests that measure how tool-enabled LLM agents can be hijacked by malicious content

0.60

0.70

5

Tool-enabled LLM agents can be hijacked by content they retrieve, causing unauthorized transactions or data leaks; firms must test agents with realistic IPI cases before deployment.

Key finding

INJECAGENT covers 1,054 test cases built from 17 user tools and 62 attacker instructions.

Numbers: 1,054 cases; 17 user tools; 62 attacker cases

Train a search-based LLM agent to self-improve via iterative synthetic trajectories and distill it into much smaller models.

0.60

0.70

4

You can cheaply build and improve multi-step question-answering agents without large human-labeled trajectory datasets, and then deploy much smaller, cheaper models that preserve most teacher performance on similar tasks.

Key finding

Self-improvement raises small-model auto-eval accuracy substantially.

Numbers: PaLM 2-XS: 44.7±3.1% -> 65.9±2.6% (pilot to 2nd gen)

Survey: how LLMs learn to use external tools — workflow, benchmarks, and open problems

0.60

0.50

0.60

4

Linking LLMs to real tools turns language models from static answerers into actionable assistants that fetch fresh facts, run domain tools, automate workflows, and show decision steps—improving trust and utility in products.

Key finding

The survey reviewed more than 150 papers on tool learning.

Numbers: 150+ papers reviewed

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

0.60

0.50

3

DrugPilot cuts manual tool switching and context failures by structuring inputs as key-value parameters, improving automation accuracy and runtime for multi-step drug workflows.

Key finding

High task-completion on TCDD tool-calling benchmark.

Numbers: Task completion: 98.0% (simple), 93.5% (multi-tool), 64.0% (multi-turn)

Split tool-use into planner, caller, summarizer so small LLMs handle APIs better

0.65

0.55

0.60

3

If you need reliable API/tool use but lack large expensive LLMs, splitting roles across small specialized models can raise correctness and cut hallucinations while keeping inference latency similar.

Key finding

Multi-LLM α-UMi improves planning and API-call metrics over a single-LLM baseline on ToolBench.

Numbers: Plan ACC 88.92 vs 81.92; Act. EM 58.94 vs 53.26 (7B, in-domain)

MetaLLM: route each query to the cheapest LLM likely to be correct, cutting cost up to 60% while keeping or improving accuracy

0.70

0.60

0.80

2

MetaLLM reduces API spend by routing easy queries to cheaper models and routes hard queries to stronger models, giving modest accuracy gains and up to ~60% cost reductions versus always using the priciest API.

Key finding

MetaLLM can outperform a mid-tier baseline (text-babbage-001) while keeping the same budget.

Numbers: SST-2: 84.06% vs 82.80% (text-babbage); cost 0.12 per 10k

ToolACE: auto-generates 26k verified APIs and complex dialogs to teach LLMs reliable function calling

0.70

0.60

2

ToolACE lets mid-size LLMs (8B) learn practical API use by supplying large, diverse, and verified synthetic tool data—reducing reliance on proprietary APIs and enabling in-house fine-tuned agents for automation tasks.

Key finding

ToolACE builds a very large synthetic API pool.

Numbers: 26,507 APIs across 390 domains

OctoTools: a training-free planner+executor agent that plugs in tools to boost multi-step reasoning

0.70

0.60

0.45

2

OctoTools turns general LLMs into practical, multi-step assistants by plugging in specialized tools and an explicit planner; this improves correctness on domain tasks and lets teams add domain tools without retraining models.

Key finding

OctoTools raises average accuracy from 49.2% to 58.5% across 16 benchmarks.

Numbers: Avg accuracy OctoTools 58.5% vs zero-shot 49.2% (∆ +9.3%)

Agent-E: hierarchical web agent with DOM denoising and change-observation — 73.2% on WebVoyager

0.45

0.60

0.50

2

A hierarchical agent with DOM denoising and action feedback raises generic web automation success to ~73% and gives actionable signals (self-aware failures) that support safe fallbacks and learning pipelines.

Key finding

Agent-E reached 73.2% task success on the WebVoyager benchmark.

Numbers: 73.2% overall success (WebVoyager)

An LLM conductor that chains music models and keeps a shared music state for iterative loop creation

0.50

0.60

0.40

2

Loop Copilot shows how an LLM can orchestrate specialized models to speed up prototyping and ideation in music; apply it to demo generation, rapid iteration, and studio assistants while planning for tighter DAW integration and finer controls.

Key finding

Participants found Loop Copilot usable

Numbers: SUS mean = 75.31 ± 15.32

A practical security and implementation guide for Plan‑then‑Execute LLM agents

0.80

0.60

0.70

1

P‑t‑E gives auditable, predictable automation and architectural defenses against prompt injection, lowering risk for regulated or high‑value workflows.

Key finding

Plan‑then‑Execute locks control flow before ingesting untrusted tool outputs, reducing risk of indirect prompt injection.

How LLM-based coding agents must earn developer trust to be useful

0.50

0.60

1

AI coding agents can cut developer time but only if they earn developer trust through verifiable outputs, provenance, and integrated review processes.

Key finding

Developer trust, not raw generation skill, is the main barrier to widespread adoption of AI software engineers.

ContextAgent: a proactive LLM agent that uses wearable sensors to reason and call tools automatically

0.55

0.70

0.50

1

Context-aware proactive assistants can act without prompts, reducing user friction and automating multi-step tasks by calling real tools; this lowers manual work and enables new hands-free services for wearables.

Key finding

ContextAgent raises proactive-decision accuracy and tool-calling correctness over baselines on the main benchmark.

Numbers: Acc-P +8.5%, F1 +7.0%, Acc-Args +6.0% (Llama3.1-8B base)

Fine-tune an LLM to parse causal questions, call causal tools, and explain results end-to-end

0.60

0.50

0.60

1

Fine-tuning an open LLM with targeted input-output examples turns it into a reliable causal assistant that extracts task details, runs analysis code, and explains results—reducing time to insight for analysts and lowering dependence on closed APIs.

Key finding

LLM4Causal-Mixed achieved much higher end-to-end accuracy (win rate) than GPT-4 on synthetic causal tasks

Numbers: Win rate avg 0.806 vs GPT4 avg ~0.12 (Table 2)

ToolEyes: a 7-scenario, 568-tool evaluation that measures five concrete tool-learning skills

0.60

0.70

1

ToolEyes quantifies how well models actually use real APIs and multi-step tools, revealing that tool-specific fine-tuning and output-format controls matter more than raw model size for production tool integration.

Key finding

GPT-4 achieves the highest overall tool-learning score among tested models.

Numbers: s_overall = 70.31% (Table 2)

AnyTool: GPT-4 agent that searches 16k+ APIs via hierarchical retrieval and self-reflection

0.60

0.70

0.50

1

AnyTool automates picking and calling hundreds of real APIs without extra model training, so teams can prototype API-heavy automation faster; expect higher success rates but plan for high token and API-call costs.

Key finding

AnyTool substantially improves real-task pass rates over prior systems.

Numbers: +35.4% average pass rate vs ToolLLM on ToolBench (as reported)

ShortcutsBench: a realistic Apple Shortcuts dataset to stress-test API-based agents

0.70

0.80

0.65

1

ShortcutsBench tests end-to-end agent behaviors (API choice, parameter filling, asking for inputs) on real user workflows, revealing practical failure points that matter for automation reliability and cost-effective model selection.

Key finding

ShortcutsBench scale and realism: 88 apps, 1,414 APIs, 7,627 shortcuts.

Numbers: 88 apps; 1,414 APIs; 7,627 shortcuts (Table 2, Sec.3.1)

Mobile-Bench: a platform and dataset to test mobile LLM agents that use both UI actions and APIs with a CheckPoint process metric

0.50

0.60

1

Mobile-Bench helps teams test phone assistants and automation agents across realistic multi-app flows and shows that APIs speed tasks but require careful selection; invest in hybrid API+UI support and robust process checks.

Key finding

Mobile-Bench dataset: 832 cases across three difficulty tiers.

Numbers: SAST 332, SAMT 300, MAMT 200