Cut the wasted work: use a big model for intent, a small model for the call, and inject the fixed syntax.

February 14, 20267 min

Overview

Decision SnapshotReady For Pilot

Strong empirical gains on a public benchmark (BFCL) with ablations that isolate each module. Code and datasets are public, but the approach adds hybrid GPU memory and depends on retriever robustness.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

License: Creative Commons Attribution 4.0 (paper); code repo license unspecified in text

At A Glance

Cost impact: 80%

Production readiness: 75%

Novelty: 75%

Authors

Weibin Liao, Jian-guang Lou, Haoyi Xiong

Links

Abstract / PDF / Code / Data

Why It Matters For Business

HyFunc reduces latency and compute for live API-style agents, enabling faster, cheaper, and more responsive assistants without sacrificing accuracy.

Who Should Care

Summary TLDR

HyFunc speeds up LLM-driven API/function calls by splitting work: a large LLM produces one semantic 'soft token' for intent, a tiny retriever picks candidate functions, a small LLM generates parameter values guided by that soft token, and fixed syntax is injected at inference. Result: much lower latency (0.828s) while keeping high accuracy (80.1% on BFCL).

Problem Statement

Converting free-text user intent to executable function calls is slow because large models repeatedly parse long tool libraries, generate predictable boilerplate tokens, and produce full call sequences instead of just the needed values.

Main Contribution

Design HyFunc, a hybrid cascade that uses a single forward pass of a large LLM to produce a 'first soft token' and hands off generation to a smaller model.

Introduce an MLP-based dual-encoder retriever trained on soft-token/function embeddings to avoid re-processing the full function library every query.

Key Findings

HyFunc reduces end-to-end inference latency to 0.828 seconds per case.

NumbersLatency = 0.828s (HyFunc ♣, Table 1)

Practical UseYou can use HyFunc to get sub-second function-call latency on commodity GPUs for real-time agents.

Evidence RefTable 1, Sec. 3.2.1

HyFunc reaches 80.1% overall accuracy on the BFCL benchmark using a 0.6B small model.

NumbersAccuracy = 80.1% (BFCL, Table 2)

Practical UseSmall models tuned via HyFunc can match or exceed larger models’ function-calling accuracy while running faster.

Evidence RefTable 2, Sec. 3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
End-to-end inference latency0.828 sToolACE-8B full model 1.984 sfaster than all baselines (Table 1)BFCL (inference measurement)Table 1: HyFunc total time 0.828s; ToolACE-8B 1.984sTable 1
Accuracy80.1%Qwen3-0.6B base 62.2%+17.9 percentage points vs base (0.6B)BFCL leaderboard (Out-of-Domain)Table 2: HyFunc ♣ overall 80.1% vs Qwen3-0.6B 62.2%Table 2

What To Try In 7 Days

Run a small pilot: add a lightweight retriever + soft-token projector to your existing function-calling pipeline and measure latency on a held-out dataset.

Enable Dynamic Templating for a single API: mask syntax tokens and inject the template while letting the model generate only parameter values.

A/B test HyFunc-tuned small model vs your current model on latency and end-to-end correctness (use BFCL or your logs).

Agent Features

Memory
KV cache for reuse across mode switches
Planning
function selection (short-horizon)
Tool Use
function callingtemplate injection
Frameworks
vLLMPyTorchHuggingFace Transformers
Is Agentic

Yes

Architectures
hybrid-model cascadedual-encoder MLP retriever

Optimization Features

Token Efficiency
generate only parameter valuesskip boilerplate via template injection
Infra Optimization
vLLM integrationsmall-model serving for majority of compute
Model Optimization
model cascade (LML → LMS)projector to map embeddings
System Optimization
split LML short forward pass and longer LMS generation to reduce latency
Training Optimization
SFTprefix continuous prompting (soft-token conditioning)
Inference Optimization
Dynamic Templating (inject fixed syntax)MLP retriever for fast candidate pruningKV Cache reuse across mode switches

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseCreative Commons Attribution 4.0 (paper); code repo license unspecified in text

Risks & Boundaries

Limitations

Sensitive to retriever quality: wrong retrievals cascade into wrong calls (Sec. 5).

Currently single-turn only: not evaluated for multi-turn clarification or active questioning (Sec. 5).

When Not To Use

When function calling is unnecessary (HyFunc forces argument generation and may add decoding overhead).

In highly multi-turn interactive agents without further extension to handle clarifying questions.

Failure Modes

Retriever returns wrong or missing candidates → LMS generates incorrect function or parameters.

Projector misalignment between LML and LMS embedding spaces causes poor conditioning and wrong values.

Core Entities

Models

ToolACE-8BQwen3-0.6BQwen2.5-0.5B-InstructQwen3 seriesQwen2.5 seriesHammer seriesxLAM seriesGranite-20BGPT-4o

Metrics

Latency (s)AccuracyToken count reduction (%)

Datasets

BFCL (Berkeley Function Call Leaderboard)Salesforce/xlam-function-calling-60k (offline prep)

Benchmarks

BFCL