Cut the wasted work: use a big model for intent, a small model for the call, and inject the fixed syntax.

February 14, 20267 min

Overview

Production Readiness

0.75

Novelty Score

0.75

Cost Impact Score

0.8

Citation Count

0

Authors

Weibin Liao, Jian-guang Lou, Haoyi Xiong

Links

Abstract / PDF

Why It Matters For Business

HyFunc reduces latency and compute for live API-style agents, enabling faster, cheaper, and more responsive assistants without sacrificing accuracy.

Summary TLDR

HyFunc speeds up LLM-driven API/function calls by splitting work: a large LLM produces one semantic 'soft token' for intent, a tiny retriever picks candidate functions, a small LLM generates parameter values guided by that soft token, and fixed syntax is injected at inference. Result: much lower latency (0.828s) while keeping high accuracy (80.1% on BFCL).

Problem Statement

Converting free-text user intent to executable function calls is slow because large models repeatedly parse long tool libraries, generate predictable boilerplate tokens, and produce full call sequences instead of just the needed values.

Main Contribution

Design HyFunc, a hybrid cascade that uses a single forward pass of a large LLM to produce a 'first soft token' and hands off generation to a smaller model.

Introduce an MLP-based dual-encoder retriever trained on soft-token/function embeddings to avoid re-processing the full function library every query.

Propose Selective SFT and Dynamic Templating: train only on parameter value tokens and inject fixed syntax during inference to skip boilerplate generation.

Key Findings

HyFunc reduces end-to-end inference latency to 0.828 seconds per case.

NumbersLatency = 0.828s (HyFunc ♣, Table 1)

HyFunc reaches 80.1% overall accuracy on the BFCL benchmark using a 0.6B small model.

NumbersAccuracy = 80.1% (BFCL, Table 2)

The hybrid design cuts output tokens by 29.54% vs the base small model and component ablations each add measurable gains.

Numbers29.54% fewer output tokens; soft-token distillation +6.6%, selective tuning +2.7%, dynamic templating +6.2% (Table 4,3)

Results

End-to-end inference latency

Value0.828 s

BaselineToolACE-8B full model 1.984 s

Accuracy

Value80.1%

BaselineQwen3-0.6B base 62.2%

Function Retriever Exact Match (Execute)

Value95.8% EMAcc

Baselinenot stated for baselines

Output token reduction

Value29.54% fewer output tokens

BaselineQwen3-0.6B

Who Should Care

What To Try In 7 Days

Run a small pilot: add a lightweight retriever + soft-token projector to your existing function-calling pipeline and measure latency on a held-out dataset.

Enable Dynamic Templating for a single API: mask syntax tokens and inject the template while letting the model generate only parameter values.

A/B test HyFunc-tuned small model vs your current model on latency and end-to-end correctness (use BFCL or your logs).

Agent Features

Memory

  • KV cache for reuse across mode switches

Planning

  • function selection (short-horizon)

Tool Use

  • function calling
  • template injection

Frameworks

  • vLLM
  • PyTorch
  • HuggingFace Transformers

Is Agentic

true

Architectures

  • hybrid-model cascade
  • dual-encoder MLP retriever

Optimization Features

Token Efficiency

  • generate only parameter values
  • skip boilerplate via template injection

Infra Optimization

  • vLLM integration
  • small-model serving for majority of compute

Model Optimization

  • model cascade (LML → LMS)
  • projector to map embeddings

System Optimization

  • split LML short forward pass and longer LMS generation to reduce latency

Training Optimization

  • SFT
  • prefix continuous prompting (soft-token conditioning)

Inference Optimization

  • Dynamic Templating (inject fixed syntax)
  • MLP retriever for fast candidate pruning
  • KV Cache reuse across mode switches

Reproducibility

License

  • Creative Commons Attribution 4.0 (paper); code repo license unspecified in text

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Sensitive to retriever quality: wrong retrievals cascade into wrong calls (Sec. 5).
  • Currently single-turn only: not evaluated for multi-turn clarification or active questioning (Sec. 5).
  • Higher peak GPU memory due to hybrid models (reported max 18.1G) even if average usage is moderate (Table 7).

When Not To Use

  • When function calling is unnecessary (HyFunc forces argument generation and may add decoding overhead).
  • In highly multi-turn interactive agents without further extension to handle clarifying questions.
  • On very low-memory GPUs where the hybrid model peak memory (≈18G) is unaffordable.

Failure Modes

  • Retriever returns wrong or missing candidates → LMS generates incorrect function or parameters.
  • Projector misalignment between LML and LMS embedding spaces causes poor conditioning and wrong values.
  • Dynamic templating forces generation of values even when no function should be called, producing meaningless parameters.

Core Entities

Models

  • ToolACE-8B
  • Qwen3-0.6B
  • Qwen2.5-0.5B-Instruct
  • Qwen3 series
  • Qwen2.5 series
  • Hammer series
  • xLAM series
  • Granite-20B
  • GPT-4o

Metrics

  • Latency (s)
  • Accuracy
  • Token count reduction (%)

Datasets

  • BFCL (Berkeley Function Call Leaderboard)
  • Salesforce/xlam-function-calling-60k (offline prep)

Benchmarks

  • BFCL