Split tool-use into planner, caller, summarizer so small LLMs handle APIs better

January 14, 20248 min

Overview

Production Readiness

0.65

Novelty Score

0.55

Cost Impact Score

0.6

Citation Count

3

Authors

Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, Fei Huang

Links

Abstract / PDF

Why It Matters For Business

If you need reliable API/tool use but lack large expensive LLMs, splitting roles across small specialized models can raise correctness and cut hallucinations while keeping inference latency similar.

Summary TLDR

The paper shows small open-source LLMs struggle when one model must plan, generate API calls, and summarize results. The authors propose α-UMi: three small LLMs (planner, caller, summarizer) trained via a two-stage global-to-local progressive fine-tuning (GLPFT). On API-tool benchmarks (ToolBench, ToolAlpaca) α-UMi (7B) improves planning accuracy and API-call correctness vs single-LLM baselines and matches or beats larger single models while adding storage and some training cost. The method is practical when you can trade modest extra training/storage for more reliable tool use.

Problem Statement

Small open-source LLMs fail to simultaneously learn planning, safe API invocation, and answer summarization. Training one small model for all roles lowers quality and increases hallucinations. The paper asks: can we split roles across multiple small models and train them so tool use improves without needing a single large model?

Main Contribution

α-UMi: a multi-LLM agent that splits tool learning into planner, caller, summarizer.

GLPFT: a two-stage fine-tuning recipe (global fine-tune then local role-specific fine-tune).

Empirical study on ToolBench/ToolAlpaca showing α-UMi improves tool-use metrics and data-scaling, with analysis of costs and training dynamics.

Key Findings

Multi-LLM α-UMi improves planning and API-call metrics over a single-LLM baseline on ToolBench.

NumbersPlan ACC 88.92 vs 81.92; Act. EM 58.94 vs 53.26 (7B, in-domain)

α-UMi reduces API-name hallucinations notably.

NumbersHallu. 0.57 vs 2.32 (ToolBench in-domain, 7B)

A 7B α-UMi can outperform a single 13B LLM on tool-use benchmarks.

NumbersPlan ACC 88.92 (α-UMi 7B) vs 81.01 (Single-LLM 13B)

GLPFT (global then local fine-tune) is necessary; naive Multi-LLM one-stage is worse.

NumbersMulti-LLM one-stage Act. EM 45.11 vs α-UMi 58.94 (7B)

Multi-LLM increases storage and training cost but not inference latency much.

NumbersStorage x3; train time 63.34h vs 41.54h (α-UMi vs Single-LLM, 7B); inference time ~6.27s vs 6.41s

Results

Plan ACC (in-domain, 7B)

Value88.92 (α-UMi w/ reuse)

Baseline81.92 (Single-LLM)

Action EM (in-domain, 7B)

Value58.94 (α-UMi w/ reuse)

Baseline53.26 (Single-LLM)

Hallucination (in-domain, 7B)

Value0.57 (α-UMi w/ reuse)

Baseline2.32 (Single-LLM)

ToolAlpaca Process Correctness (7B)

Value41 (α-UMi)

Baseline11 (Single-LLM)

Real-time pass rate (avg)

Value70.9 (α-UMi 7B average pass)

Baseline64.8 (ChatGPT avg pass reported)

Who Should Care

What To Try In 7 Days

Run a small pilot: fine-tune one shared 7B backbone then clone it into planner/caller/summarizer following GLPFT.

Measure step-level metrics (Plan ACC, Action EM, Hallu.) on a held-out set of your APIs.

Compare storage and training cost vs swapping to one larger model to verify cost-effectiveness for your workload.

Agent Features

Memory

  • short execution trajectory passed between steps (τ_t-1)

Planning

  • dedicated planner LLM
  • rationale generation
  • Next-step decision (caller/conclusion/give up)

Tool Use

  • dedicated caller LLM for API calls
  • explicit API-name/argument formatting prompt

Frameworks

  • α-UMi multi-LLM agent
  • GLPFT (global-to-local progressive fine-tuning)

Is Agentic

true

Architectures

  • LLaMA-2-chat-7B
  • LLaMA-2-chat-13B

Collaboration

  • sequential planner->caller->(tools)->planner loop
  • final summarizer composes user answer

Optimization Features

Infra Optimization

  • use DeepSpeed ZeRO Stage3 for fine-tuning

System Optimization

  • role-specific prompts to narrow model outputs

Training Optimization

  • two-stage GLPFT: global fine-tune shared backbone, then local role-specific fine-tune

Inference Optimization

  • no extra generation at inference; sub-tasks distributed to keep latency similar

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires ~3× model storage if all role models use the same backbone size.
  • Training cost increases ~1.3×–1.5× (longer fine-tuning) compared to single-LLM.
  • Evaluations are on API-call benchmarks; behavior on other agent tasks may vary.
  • Orchestration and prompt engineering become operational responsibilities.

When Not To Use

  • When storage is extremely constrained (cannot host multiple model copies).
  • When you can afford a single much larger LLM that already meets your tool-use needs.
  • When tasks are trivial single-call workflows where a single model suffices.

Failure Modes

  • Caller still outputs malformed requests if prompts or examples are poor.
  • Broken or changing APIs can force recovery loops despite planner fallback.
  • Distribution shift in user instructions can bias one role if not covered in fine-tuning.

Core Entities

Models

  • LLaMA-2-chat-7B
  • LLaMA-2-chat-13B
  • α-UMi (planner/caller/summarizer variants)

Metrics

  • Plan ACC
  • Action EM
  • Argument F1
  • Rouge-L
  • Hallucination rate
  • Proc. correctness (ToolAlpaca)
  • Ans. correctness (ToolAlpaca)
  • Pass rate / Win rate (real-time)

Datasets

  • ToolBench
  • ToolAlpaca
  • MATH
  • GSM8K

Benchmarks

  • ToolBench
  • ToolAlpaca
  • MATH
  • GSM8K

Context Entities

Models

  • ChatGPT (GPT-3.5)
  • GPT-4
  • Claude-2
  • ToolLLaMA