Split tool-use into planner, caller, summarizer so small LLMs handle APIs better

Overview

Decision SnapshotReady For Pilot

The approach is a pragmatic engineering trade: better correctness and lower hallucination for small LLMs at the cost of more storage and some extra training; inference latency stays similar.

Citations3

Evidence Strength0.75

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 65%

Novelty: 55%

Authors

Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, Fei Huang

Links

Abstract / PDF / Code

Why It Matters For Business

If you need reliable API/tool use but lack large expensive LLMs, splitting roles across small specialized models can raise correctness and cut hallucinations while keeping inference latency similar.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The paper shows small open-source LLMs struggle when one model must plan, generate API calls, and summarize results. The authors propose α-UMi: three small LLMs (planner, caller, summarizer) trained via a two-stage global-to-local progressive fine-tuning (GLPFT). On API-tool benchmarks (ToolBench, ToolAlpaca) α-UMi (7B) improves planning accuracy and API-call correctness vs single-LLM baselines and matches or beats larger single models while adding storage and some training cost. The method is practical when you can trade modest extra training/storage for more reliable tool use.

Problem Statement

Small open-source LLMs fail to simultaneously learn planning, safe API invocation, and answer summarization. Training one small model for all roles lowers quality and increases hallucinations. The paper asks: can we split roles across multiple small models and train them so tool use improves without needing a single large model?

Main Contribution

α-UMi: a multi-LLM agent that splits tool learning into planner, caller, summarizer.

GLPFT: a two-stage fine-tuning recipe (global fine-tune then local role-specific fine-tune).

Key Findings

Multi-LLM α-UMi improves planning and API-call metrics over a single-LLM baseline on ToolBench.

NumbersPlan ACC 88.92 vs 81.92; Act. EM 58.94 vs 53.26 (7B, in-domain)

Practical UseUse role decomposition to raise step-level planning and correct API actions when you have limited model size.

Evidence RefTable 1 (ToolBench, 7B, α-UMi w/ reuse vs Single-LLM)

α-UMi reduces API-name hallucinations notably.

NumbersHallu. 0.57 vs 2.32 (ToolBench in-domain, 7B)

Practical UseSplitting caller role cuts hallucinated API calls, lowering broken-call risks and user confusion.

Evidence RefTable 1 (Hallu. column)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Plan ACC (in-domain, 7B)	88.92 (α-UMi w/ reuse)	81.92 (Single-LLM)	+7.0	ToolBench (in-domain)	Table 1: Plan ACC 88.92 vs 81.92	Table 1
Action EM (in-domain, 7B)	58.94 (α-UMi w/ reuse)	53.26 (Single-LLM)	+5.68	ToolBench (in-domain)	Table 1: Act. EM 58.94 vs 53.26	Table 1

What To Try In 7 Days

Run a small pilot: fine-tune one shared 7B backbone then clone it into planner/caller/summarizer following GLPFT.

Measure step-level metrics (Plan ACC, Action EM, Hallu.) on a held-out set of your APIs.

Compare storage and training cost vs swapping to one larger model to verify cost-effectiveness for your workload.

Agent Features

Memory

short execution trajectory passed between steps (τ_t-1)

Planning

dedicated planner LLMrationale generationNext-step decision (caller/conclusion/give up)

Tool Use

dedicated caller LLM for API callsexplicit API-name/argument formatting prompt

Frameworks

α-UMi multi-LLM agentGLPFT (global-to-local progressive fine-tuning)

Is Agentic

Yes

Architectures

LLaMA-2-chat-7BLLaMA-2-chat-13B

Collaboration

sequential planner->caller->(tools)->planner loopfinal summarizer composes user answer

Optimization Features

Infra Optimization

use DeepSpeed ZeRO Stage3 for fine-tuning

System Optimization

role-specific prompts to narrow model outputs

Training Optimization

two-stage GLPFT: global fine-tune shared backbone, then local role-specific fine-tune

Inference Optimization

no extra generation at inference; sub-tasks distributed to keep latency similar

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/X-PLUG/Multi-LLM-Agent

Risks & Boundaries

Limitations

Requires ~3× model storage if all role models use the same backbone size.

Training cost increases ~1.3×–1.5× (longer fine-tuning) compared to single-LLM.

When Not To Use

When storage is extremely constrained (cannot host multiple model copies).

When you can afford a single much larger LLM that already meets your tool-use needs.

Failure Modes

Caller still outputs malformed requests if prompts or examples are poor.

Broken or changing APIs can force recovery loops despite planner fallback.

Core Entities

Models

LLaMA-2-chat-7BLLaMA-2-chat-13Bα-UMi (planner/caller/summarizer variants)

Metrics

Plan ACCAction EMArgument F1Rouge-LHallucination rateProc. correctness (ToolAlpaca)Ans. correctness (ToolAlpaca)Pass rate / Win rate (real-time)

Datasets

ToolBenchToolAlpacaMATHGSM8K

Benchmarks

ToolBenchToolAlpacaMATHGSM8K

Context Entities

Models

ChatGPT (GPT-3.5)GPT-4Claude-2ToolLLaMA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Multi-LLM α-UMi improves planning and API-call metrics over a single-LLM baseline on ToolBench.

α-UMi reduces API-name hallucinations notably.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding