Overview
Production Readiness
0.75
Novelty Score
0.75
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
HyFunc reduces latency and compute for live API-style agents, enabling faster, cheaper, and more responsive assistants without sacrificing accuracy.
Summary TLDR
HyFunc speeds up LLM-driven API/function calls by splitting work: a large LLM produces one semantic 'soft token' for intent, a tiny retriever picks candidate functions, a small LLM generates parameter values guided by that soft token, and fixed syntax is injected at inference. Result: much lower latency (0.828s) while keeping high accuracy (80.1% on BFCL).
Problem Statement
Converting free-text user intent to executable function calls is slow because large models repeatedly parse long tool libraries, generate predictable boilerplate tokens, and produce full call sequences instead of just the needed values.
Main Contribution
Design HyFunc, a hybrid cascade that uses a single forward pass of a large LLM to produce a 'first soft token' and hands off generation to a smaller model.
Introduce an MLP-based dual-encoder retriever trained on soft-token/function embeddings to avoid re-processing the full function library every query.
Propose Selective SFT and Dynamic Templating: train only on parameter value tokens and inject fixed syntax during inference to skip boilerplate generation.
Key Findings
HyFunc reduces end-to-end inference latency to 0.828 seconds per case.
HyFunc reaches 80.1% overall accuracy on the BFCL benchmark using a 0.6B small model.
The hybrid design cuts output tokens by 29.54% vs the base small model and component ablations each add measurable gains.
Results
End-to-end inference latency
Accuracy
Function Retriever Exact Match (Execute)
Output token reduction
Who Should Care
What To Try In 7 Days
Run a small pilot: add a lightweight retriever + soft-token projector to your existing function-calling pipeline and measure latency on a held-out dataset.
Enable Dynamic Templating for a single API: mask syntax tokens and inject the template while letting the model generate only parameter values.
A/B test HyFunc-tuned small model vs your current model on latency and end-to-end correctness (use BFCL or your logs).
Agent Features
Memory
- KV cache for reuse across mode switches
Planning
- function selection (short-horizon)
Tool Use
- function calling
- template injection
Frameworks
- vLLM
- PyTorch
- HuggingFace Transformers
Is Agentic
true
Architectures
- hybrid-model cascade
- dual-encoder MLP retriever
Optimization Features
Token Efficiency
- generate only parameter values
- skip boilerplate via template injection
Infra Optimization
- vLLM integration
- small-model serving for majority of compute
Model Optimization
- model cascade (LML → LMS)
- projector to map embeddings
System Optimization
- split LML short forward pass and longer LMS generation to reduce latency
Training Optimization
- SFT
- prefix continuous prompting (soft-token conditioning)
Inference Optimization
- Dynamic Templating (inject fixed syntax)
- MLP retriever for fast candidate pruning
- KV Cache reuse across mode switches
Reproducibility
License
- Creative Commons Attribution 4.0 (paper); code repo license unspecified in text
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Sensitive to retriever quality: wrong retrievals cascade into wrong calls (Sec. 5).
- Currently single-turn only: not evaluated for multi-turn clarification or active questioning (Sec. 5).
- Higher peak GPU memory due to hybrid models (reported max 18.1G) even if average usage is moderate (Table 7).
When Not To Use
- When function calling is unnecessary (HyFunc forces argument generation and may add decoding overhead).
- In highly multi-turn interactive agents without further extension to handle clarifying questions.
- On very low-memory GPUs where the hybrid model peak memory (≈18G) is unaffordable.
Failure Modes
- Retriever returns wrong or missing candidates → LMS generates incorrect function or parameters.
- Projector misalignment between LML and LMS embedding spaces causes poor conditioning and wrong values.
- Dynamic templating forces generation of values even when no function should be called, producing meaningless parameters.
Core Entities
Models
- ToolACE-8B
- Qwen3-0.6B
- Qwen2.5-0.5B-Instruct
- Qwen3 series
- Qwen2.5 series
- Hammer series
- xLAM series
- Granite-20B
- GPT-4o
Metrics
- Latency (s)
- Accuracy
- Token count reduction (%)
Datasets
- BFCL (Berkeley Function Call Leaderboard)
- Salesforce/xlam-function-calling-60k (offline prep)
Benchmarks
- BFCL

