Survey: how LLMs learn to use external tools — workflow, benchmarks, and open problems

May 28, 20248 min

Overview

Decision SnapshotNeeds Validation

This survey compiles and critiques existing methods and benchmarks; it is a practical navigator but not original experimental work, so use it to find methods and datasets rather than as new evidence of a single technique's superiority.

Citations4

Evidence Strength0.75

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 2/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, Ji-Rong Wen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Linking LLMs to real tools turns language models from static answerers into actionable assistants that fetch fresh facts, run domain tools, automate workflows, and show decision steps—improving trust and utility in products.

Who Should Care

Summary TLDR

This paper is a focused, up-to-date survey of 'tool learning' for large language models (LLMs). It explains why connecting LLMs to external tools matters (knowledge updates, domain expertise, automation, multi-modal input, interpretability, robustness), organizes methods around a four-stage workflow (task planning, tool selection, tool calling, response generation), catalogs 33+ benchmarks and toolkits, and lists major gaps: latency, benchmark realism, tool quality, safety, and unified frameworks. The survey reviews over 150 papers and points to practical toolkits such as LangChain and Auto-GPT.

Problem Statement

LLM capabilities grow but remain limited by fixed knowledge, weak specialty skills, inability to act, and brittle inputs. Research on letting LLMs call external tools is fast but fragmented. Practitioners lack a clear, unified map of methods, benchmarks, and open gaps to build reliable tool-augmented systems.

Main Contribution

Organizes tool learning around a four-stage workflow: task planning, tool selection, tool calling, response generation.

Summarizes why tools help LLMs across six concrete benefits: knowledge, expertise, automation, interaction, interpretability, robustness.

Key Findings

The survey reviewed more than 150 papers on tool learning.

Numbers150+ papers reviewed

Practical UseUse this paper as a compact entry point to the field instead of scanning many separate papers.

Evidence RefConclusion, §7

ToolBench2 is currently the largest public tool-learning dataset.

Numbers16,464 tools; 126,486 instances

Practical UseIf you need large-scale training or evaluation, start with ToolBench2 but expect quality issues and tool availability problems.

Evidence RefTable 1; §5.1

What To Try In 7 Days

Run a two-stage tool pipeline: add a fast retriever (BM25/embeddings) to pick 5 candidate APIs and let your LLM choose among them.

Use an existing toolkit (LangChain or BMTools) to prototype a single-use case (e.g., pricing lookup) and measure latency and failure modes.

Evaluate with both synthetic and a small set of 100 real user queries to spot benchmark gaps and mismatches.

Agent Features

Memory
Retrieval memory (RAG-style external knowledge)
Planning
One-step (plan once) planningIterative (feedback-driven) planningDecision-tree / search-based planning
Tool Use
API function callsPlugin-based tool invocationChained serial tool calls
Frameworks
LangChainAuto-GPTBabyAGIBMTools
Is Agentic

Yes

Architectures
Retriever + LLM pipelineTool graph / directed-graph navigationMulti-agent collaborative frameworks
Collaboration
Multi-agent cooperation (specialized execution agents)

Optimization Features

Token Efficiency
Context compression and output summarization (ReCOMP, Xu et al.)
System Optimization
Two-stage retrieval + LLM selection to manage large tool sets
Training Optimization
Finetuning with API-call datasets (Toolformer, ToolLLaMA)RLHF and execution-feedback tuning
Inference Optimization
Retriever narrowing to reduce context and latencyIntent-driven gating for tool selection

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Benchmarks referenced have public links (see Table 1; e.g., ToolBench2, API-Bank)

Risks & Boundaries

Limitations

Survey relies on many recent papers; the field evolves rapidly and new benchmarks may appear after publication.

Benchmarks cataloged often contain synthetic, LLM-generated queries that may not match live user behavior.

When Not To Use

When you require verified, safety-critical behavior without independent tool-output validation.

When latency must be sub-second and external API calls introduce delay.

Failure Modes

Hallucinations driven by erroneous or malicious tool outputs.

Tool-call format errors and parameter mis-parsing causing failed calls.

Core Entities

Models

GPT-4GPT-JLLaMALLaMA-7BToolLLaMAToolformer

Metrics

Recall@KNDCG@KCOMP@KBLEUROUGE-LExact MatchF1

Datasets

ToolBench2API-BankAPIBenchToolBench1ToolAlpacaRestBenchToolEyesSeal-Tools

Benchmarks

API-BankAPIBenchToolBench1ToolBench2ToolEyesMetaToolRestBenchT-EvalUltraTool

Context Entities

Models

GorillaToolNetProTIPCRAFT

Metrics

Pass rate (ChatGPT)Human evaluationToolEval automated scoring

Datasets

ToolQATaskBenchAPI-BLENDStableToolBenchSciToolBench

Benchmarks

RoTBenchToolSwordMLLM-ToolSoAyBenchToolLens