Cut the wasted work: use a big model for intent, a small model for the call, and inject the fixed syntax.

Overview

Decision SnapshotReady For Pilot

Strong empirical gains on a public benchmark (BFCL) with ablations that isolate each module. Code and datasets are public, but the approach adds hybrid GPU memory and depends on retriever robustness.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

License: Creative Commons Attribution 4.0 (paper); code repo license unspecified in text

At A Glance

Cost impact: 80%

Production readiness: 75%

Novelty: 75%

Authors

Weibin Liao, Jian-guang Lou, Haoyi Xiong

Links

Abstract / PDF / Code / Data

Why It Matters For Business

HyFunc reduces latency and compute for live API-style agents, enabling faster, cheaper, and more responsive assistants without sacrificing accuracy.

Who Should Care

CTO Product Manager ML Engineer

Summary TLDR

HyFunc speeds up LLM-driven API/function calls by splitting work: a large LLM produces one semantic 'soft token' for intent, a tiny retriever picks candidate functions, a small LLM generates parameter values guided by that soft token, and fixed syntax is injected at inference. Result: much lower latency (0.828s) while keeping high accuracy (80.1% on BFCL).

Problem Statement

Converting free-text user intent to executable function calls is slow because large models repeatedly parse long tool libraries, generate predictable boilerplate tokens, and produce full call sequences instead of just the needed values.

Main Contribution

Design HyFunc, a hybrid cascade that uses a single forward pass of a large LLM to produce a 'first soft token' and hands off generation to a smaller model.

Introduce an MLP-based dual-encoder retriever trained on soft-token/function embeddings to avoid re-processing the full function library every query.

Key Findings

HyFunc reduces end-to-end inference latency to 0.828 seconds per case.

NumbersLatency = 0.828s (HyFunc ♣, Table 1)

Practical UseYou can use HyFunc to get sub-second function-call latency on commodity GPUs for real-time agents.

Evidence RefTable 1, Sec. 3.2.1

HyFunc reaches 80.1% overall accuracy on the BFCL benchmark using a 0.6B small model.

NumbersAccuracy = 80.1% (BFCL, Table 2)

Practical UseSmall models tuned via HyFunc can match or exceed larger models’ function-calling accuracy while running faster.

Evidence RefTable 2, Sec. 3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
End-to-end inference latency	0.828 s	ToolACE-8B full model 1.984 s	faster than all baselines (Table 1)	BFCL (inference measurement)	Table 1: HyFunc total time 0.828s; ToolACE-8B 1.984s	Table 1
Accuracy	80.1%	Qwen3-0.6B base 62.2%	+17.9 percentage points vs base (0.6B)	BFCL leaderboard (Out-of-Domain)	Table 2: HyFunc ♣ overall 80.1% vs Qwen3-0.6B 62.2%	Table 2

What To Try In 7 Days

Run a small pilot: add a lightweight retriever + soft-token projector to your existing function-calling pipeline and measure latency on a held-out dataset.

Enable Dynamic Templating for a single API: mask syntax tokens and inject the template while letting the model generate only parameter values.

A/B test HyFunc-tuned small model vs your current model on latency and end-to-end correctness (use BFCL or your logs).

Agent Features

Memory

KV cache for reuse across mode switches

Planning

function selection (short-horizon)

Tool Use

function callingtemplate injection

Frameworks

vLLMPyTorchHuggingFace Transformers

Is Agentic

Yes

Architectures

hybrid-model cascadedual-encoder MLP retriever

Optimization Features

Token Efficiency

generate only parameter valuesskip boilerplate via template injection

Infra Optimization

vLLM integrationsmall-model serving for majority of compute

Model Optimization

model cascade (LML → LMS)projector to map embeddings

System Optimization

split LML short forward pass and longer LMS generation to reduce latency

Training Optimization

SFTprefix continuous prompting (soft-token conditioning)

Inference Optimization

Dynamic Templating (inject fixed syntax)MLP retriever for fast candidate pruningKV Cache reuse across mode switches

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseCreative Commons Attribution 4.0 (paper); code repo license unspecified in text

Code URLs

https://github.com/MrBlankness/HyFunc https://doi.org/10.5281/zenodo.18137443

Data URLs

https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k https://gorilla.cs.berkeley.edu/leaderboard.html (BFCL)

Risks & Boundaries

Limitations

Sensitive to retriever quality: wrong retrievals cascade into wrong calls (Sec. 5).

Currently single-turn only: not evaluated for multi-turn clarification or active questioning (Sec. 5).

When Not To Use

When function calling is unnecessary (HyFunc forces argument generation and may add decoding overhead).

In highly multi-turn interactive agents without further extension to handle clarifying questions.

Failure Modes

Retriever returns wrong or missing candidates → LMS generates incorrect function or parameters.

Projector misalignment between LML and LMS embedding spaces causes poor conditioning and wrong values.

Core Entities

Models

ToolACE-8BQwen3-0.6BQwen2.5-0.5B-InstructQwen3 seriesQwen2.5 seriesHammer seriesxLAM seriesGranite-20BGPT-4o

Metrics

Latency (s)AccuracyToken count reduction (%)

Datasets

BFCL (Berkeley Function Call Leaderboard)Salesforce/xlam-function-calling-60k (offline prep)

Benchmarks

BFCL

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

HyFunc reduces end-to-end inference latency to 0.828 seconds per case.

HyFunc reaches 80.1% overall accuracy on the BFCL benchmark using a 0.6B small model.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

Key finding

Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

Key finding