Overview
The paper provides extensive zero-shot comparisons on multiple benchmarks and publishes the model under Apache 2.0; results are strong for function selection but parameter extraction still needs validation in production.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/6
Reproducibility
Status: Code + data available
Open source: Yes
License: Apache 2.0
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
GRANITE-20B-FUNCTIONCALLING is an open, production-ready model for reliable API selection and response synthesis; it lowers risk from calling wrong APIs and offers a license-friendly alternative to closed models.
Who Should Care
Summary TLDR
The authors release GRANITE-20B-FUNCTIONCALLING, a 20B-parameter, instruction-tuned code model trained with a multi-task mixture focused on seven granular function-calling tasks (name detection, param extraction, sequencing, chaining, parallel calls, next-best function, and response generation). Trained with QLoRA on ~142K examples drawn from API-BLEND and Glaive-V2, the model is the best open-license entry on the Berkeley Function Calling Leaderboard (BFCL) and shows strong out-of-domain generalization on several academic benchmarks. It is good at choosing which functions to call; weaker on fully filling parameters. The model and weights are released under Apache 2.0.
Problem Statement
LLMs used as agents must reliably identify, sequence, and invoke external APIs. Existing function-calling models often fail on generalization, on handling fine-grained sub-tasks (e.g., parameter extraction, next-best-function), or are proprietary. The paper aims to build an open model that learns these granular tasks jointly and generalizes to out-of-domain benchmarks.
Main Contribution
Released GRANITE-20B-FUNCTIONCALLING, an Apache-2.0 open 20B instruction-tuned model focused on function calling.
Designed a multi-task training mixture covering seven granular function-calling tasks using API-BLEND and Glaive-V2 (~142K examples).
Key Findings
GRANITE-20B-FUNCTIONCALLING ranks 4th on BFCL overall accuracy and is the top open-license model.
Model is especially strong at detecting which functions to call from text.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 84.71 | — | — | BFCL (zero-shot) | Ranked 4th overall and top among open-license models | Table 4 |
| BFCL AST / Execution / Relevance | AST 84.11 | Exec 86.5 | Relevance 87.08 | — | — | BFCL | Per-category scores reported on BFCL | Table 4 |
What To Try In 7 Days
Run the Hugging Face release of GRANITE-20B-FUNCTIONCALLING on a dev dataset and compare function-name F1 versus your current model.
Add a light validator for parameter types and required fields before executing predicted API calls.
Use the model to generate candidate function sequences and log-only execute them for a week to spot hallucinations.
Agent Features
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Context-length limits forced removal of argument types and required/optional flags from function specs during training and evaluation.
Evaluation on Java/JavaScript/REST categories shows brittleness tied to syntax rules and external API availability.
When Not To Use
When you need full typed function specs in prompt and cannot truncate signatures.
When live API execution depends on strict parameter typing without secondary validation.
Failure Modes
Missing or incorrect parameter values leading to failed API calls.
Hallucinated function calls when encountering very different libraries than trained ones (rare but possible).

