Split tool-use into planner, caller, summarizer so small LLMs handle APIs better

January 14, 20248 min

Overview

Decision SnapshotReady For Pilot

The approach is a pragmatic engineering trade: better correctness and lower hallucination for small LLMs at the cost of more storage and some extra training; inference latency stays similar.

Citations3

Evidence Strength0.75

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 65%

Novelty: 55%

Authors

Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, Fei Huang

Links

Abstract / PDF / Code

Why It Matters For Business

If you need reliable API/tool use but lack large expensive LLMs, splitting roles across small specialized models can raise correctness and cut hallucinations while keeping inference latency similar.

Who Should Care

Summary TLDR

The paper shows small open-source LLMs struggle when one model must plan, generate API calls, and summarize results. The authors propose α-UMi: three small LLMs (planner, caller, summarizer) trained via a two-stage global-to-local progressive fine-tuning (GLPFT). On API-tool benchmarks (ToolBench, ToolAlpaca) α-UMi (7B) improves planning accuracy and API-call correctness vs single-LLM baselines and matches or beats larger single models while adding storage and some training cost. The method is practical when you can trade modest extra training/storage for more reliable tool use.

Problem Statement

Small open-source LLMs fail to simultaneously learn planning, safe API invocation, and answer summarization. Training one small model for all roles lowers quality and increases hallucinations. The paper asks: can we split roles across multiple small models and train them so tool use improves without needing a single large model?

Main Contribution

α-UMi: a multi-LLM agent that splits tool learning into planner, caller, summarizer.

GLPFT: a two-stage fine-tuning recipe (global fine-tune then local role-specific fine-tune).

Key Findings

Multi-LLM α-UMi improves planning and API-call metrics over a single-LLM baseline on ToolBench.

NumbersPlan ACC 88.92 vs 81.92; Act. EM 58.94 vs 53.26 (7B, in-domain)

Practical UseUse role decomposition to raise step-level planning and correct API actions when you have limited model size.

Evidence RefTable 1 (ToolBench, 7B, α-UMi w/ reuse vs Single-LLM)

α-UMi reduces API-name hallucinations notably.

NumbersHallu. 0.57 vs 2.32 (ToolBench in-domain, 7B)

Practical UseSplitting caller role cuts hallucinated API calls, lowering broken-call risks and user confusion.

Evidence RefTable 1 (Hallu. column)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Plan ACC (in-domain, 7B)88.92 (α-UMi w/ reuse)81.92 (Single-LLM)+7.0ToolBench (in-domain)Table 1: Plan ACC 88.92 vs 81.92Table 1
Action EM (in-domain, 7B)58.94 (α-UMi w/ reuse)53.26 (Single-LLM)+5.68ToolBench (in-domain)Table 1: Act. EM 58.94 vs 53.26Table 1

What To Try In 7 Days

Run a small pilot: fine-tune one shared 7B backbone then clone it into planner/caller/summarizer following GLPFT.

Measure step-level metrics (Plan ACC, Action EM, Hallu.) on a held-out set of your APIs.

Compare storage and training cost vs swapping to one larger model to verify cost-effectiveness for your workload.

Agent Features

Memory
short execution trajectory passed between steps (τ_t-1)
Planning
dedicated planner LLMrationale generationNext-step decision (caller/conclusion/give up)
Tool Use
dedicated caller LLM for API callsexplicit API-name/argument formatting prompt
Frameworks
α-UMi multi-LLM agentGLPFT (global-to-local progressive fine-tuning)
Is Agentic

Yes

Architectures
LLaMA-2-chat-7BLLaMA-2-chat-13B
Collaboration
sequential planner->caller->(tools)->planner loopfinal summarizer composes user answer

Optimization Features

Infra Optimization
use DeepSpeed ZeRO Stage3 for fine-tuning
System Optimization
role-specific prompts to narrow model outputs
Training Optimization
two-stage GLPFT: global fine-tune shared backbone, then local role-specific fine-tune
Inference Optimization
no extra generation at inference; sub-tasks distributed to keep latency similar

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires ~3× model storage if all role models use the same backbone size.

Training cost increases ~1.3×–1.5× (longer fine-tuning) compared to single-LLM.

When Not To Use

When storage is extremely constrained (cannot host multiple model copies).

When you can afford a single much larger LLM that already meets your tool-use needs.

Failure Modes

Caller still outputs malformed requests if prompts or examples are poor.

Broken or changing APIs can force recovery loops despite planner fallback.

Core Entities

Models

LLaMA-2-chat-7BLLaMA-2-chat-13Bα-UMi (planner/caller/summarizer variants)

Metrics

Plan ACCAction EMArgument F1Rouge-LHallucination rateProc. correctness (ToolAlpaca)Ans. correctness (ToolAlpaca)Pass rate / Win rate (real-time)

Datasets

ToolBenchToolAlpacaMATHGSM8K

Benchmarks

ToolBenchToolAlpacaMATHGSM8K

Context Entities

Models

ChatGPT (GPT-3.5)GPT-4Claude-2ToolLLaMA