Survey: how LLMs learn to use external tools — workflow, benchmarks, and open problems

Overview

Decision SnapshotNeeds Validation

This survey compiles and critiques existing methods and benchmarks; it is a practical navigator but not original experimental work, so use it to find methods and datasets rather than as new evidence of a single technique's superiority.

Citations4

Evidence Strength0.75

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 2/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, Ji-Rong Wen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Linking LLMs to real tools turns language models from static answerers into actionable assistants that fetch fresh facts, run domain tools, automate workflows, and show decision steps—improving trust and utility in products.

Who Should Care

Product Manager ML Engineer Engineering Lead Data Scientist CTO

Summary TLDR

This paper is a focused, up-to-date survey of 'tool learning' for large language models (LLMs). It explains why connecting LLMs to external tools matters (knowledge updates, domain expertise, automation, multi-modal input, interpretability, robustness), organizes methods around a four-stage workflow (task planning, tool selection, tool calling, response generation), catalogs 33+ benchmarks and toolkits, and lists major gaps: latency, benchmark realism, tool quality, safety, and unified frameworks. The survey reviews over 150 papers and points to practical toolkits such as LangChain and Auto-GPT.

Problem Statement

LLM capabilities grow but remain limited by fixed knowledge, weak specialty skills, inability to act, and brittle inputs. Research on letting LLMs call external tools is fast but fragmented. Practitioners lack a clear, unified map of methods, benchmarks, and open gaps to build reliable tool-augmented systems.

Main Contribution

Organizes tool learning around a four-stage workflow: task planning, tool selection, tool calling, response generation.

Summarizes why tools help LLMs across six concrete benefits: knowledge, expertise, automation, interaction, interpretability, robustness.

Key Findings

The survey reviewed more than 150 papers on tool learning.

Numbers150+ papers reviewed

Practical UseUse this paper as a compact entry point to the field instead of scanning many separate papers.

Evidence RefConclusion, §7

ToolBench2 is currently the largest public tool-learning dataset.

Numbers16,464 tools; 126,486 instances

Practical UseIf you need large-scale training or evaluation, start with ToolBench2 but expect quality issues and tool availability problems.

Evidence RefTable 1; §5.1

What To Try In 7 Days

Run a two-stage tool pipeline: add a fast retriever (BM25/embeddings) to pick 5 candidate APIs and let your LLM choose among them.

Use an existing toolkit (LangChain or BMTools) to prototype a single-use case (e.g., pricing lookup) and measure latency and failure modes.

Evaluate with both synthetic and a small set of 100 real user queries to spot benchmark gaps and mismatches.

Agent Features

Memory

Retrieval memory (RAG-style external knowledge)

Planning

One-step (plan once) planningIterative (feedback-driven) planningDecision-tree / search-based planning

Tool Use

API function callsPlugin-based tool invocationChained serial tool calls

Frameworks

LangChainAuto-GPTBabyAGIBMTools

Is Agentic

Yes

Architectures

Retriever + LLM pipelineTool graph / directed-graph navigationMulti-agent collaborative frameworks

Collaboration

Multi-agent cooperation (specialized execution agents)

Optimization Features

Token Efficiency

Context compression and output summarization (ReCOMP, Xu et al.)

System Optimization

Two-stage retrieval + LLM selection to manage large tool sets

Training Optimization

Finetuning with API-call datasets (Toolformer, ToolLLaMA)RLHF and execution-feedback tuning

Inference Optimization

Retriever narrowing to reduce context and latencyIntent-driven gating for tool selection

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/quchangle1/LLM-Tool-Survey

Data URLs

Benchmarks referenced have public links (see Table 1; e.g., ToolBench2, API-Bank)

Risks & Boundaries

Limitations

Survey relies on many recent papers; the field evolves rapidly and new benchmarks may appear after publication.

Benchmarks cataloged often contain synthetic, LLM-generated queries that may not match live user behavior.

When Not To Use

When you require verified, safety-critical behavior without independent tool-output validation.

When latency must be sub-second and external API calls introduce delay.

Failure Modes

Hallucinations driven by erroneous or malicious tool outputs.

Tool-call format errors and parameter mis-parsing causing failed calls.

Core Entities

Models

GPT-4GPT-JLLaMALLaMA-7BToolLLaMAToolformer

Metrics

Recall@KNDCG@KCOMP@KBLEUROUGE-LExact MatchF1

Datasets

ToolBench2API-BankAPIBenchToolBench1ToolAlpacaRestBenchToolEyesSeal-Tools

Benchmarks

API-BankAPIBenchToolBench1ToolBench2ToolEyesMetaToolRestBenchT-EvalUltraTool

Context Entities

Models

GorillaToolNetProTIPCRAFT

Metrics

Pass rate (ChatGPT)Human evaluationToolEval automated scoring

Datasets

ToolQATaskBenchAPI-BLENDStableToolBenchSciToolBench

Benchmarks

RoTBenchToolSwordMLLM-ToolSoAyBenchToolLens

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

The survey reviewed more than 150 papers on tool learning.

ToolBench2 is currently the largest public tool-learning dataset.

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding