Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
GRANITE-20B-FUNCTIONCALLING is an open, production-ready model for reliable API selection and response synthesis; it lowers risk from calling wrong APIs and offers a license-friendly alternative to closed models.
Summary TLDR
The authors release GRANITE-20B-FUNCTIONCALLING, a 20B-parameter, instruction-tuned code model trained with a multi-task mixture focused on seven granular function-calling tasks (name detection, param extraction, sequencing, chaining, parallel calls, next-best function, and response generation). Trained with QLoRA on ~142K examples drawn from API-BLEND and Glaive-V2, the model is the best open-license entry on the Berkeley Function Calling Leaderboard (BFCL) and shows strong out-of-domain generalization on several academic benchmarks. It is good at choosing which functions to call; weaker on fully filling parameters. The model and weights are released under Apache 2.0.
Problem Statement
LLMs used as agents must reliably identify, sequence, and invoke external APIs. Existing function-calling models often fail on generalization, on handling fine-grained sub-tasks (e.g., parameter extraction, next-best-function), or are proprietary. The paper aims to build an open model that learns these granular tasks jointly and generalizes to out-of-domain benchmarks.
Main Contribution
Released GRANITE-20B-FUNCTIONCALLING, an Apache-2.0 open 20B instruction-tuned model focused on function calling.
Designed a multi-task training mixture covering seven granular function-calling tasks using API-BLEND and Glaive-V2 (~142K examples).
Used QLoRA fine-tuning (rank-8) on GRANITE-20B-CODE-INSTRUCT and trained for 3 epochs on 8x A100-80GB.
Comprehensive zero-shot evaluation across BFCL and six out-of-domain academic benchmarks; best open-model on BFCL and strong generalization.
Key Findings
GRANITE-20B-FUNCTIONCALLING ranks 4th on BFCL overall accuracy and is the top open-license model.
Model is especially strong at detecting which functions to call from text.
Identifying arguments (parameters and values) is weaker than function-name detection.
Natural-language response quality is near the best evaluated model.
Hallucination rate for predicting function names is low.
Results
Accuracy
BFCL AST / Execution / Relevance
Function name detection (avg)
Full function calling (name+args) average F1
Response generation (API-Bank)
Hallucination rate (function names)
Who Should Care
What To Try In 7 Days
Run the Hugging Face release of GRANITE-20B-FUNCTIONCALLING on a dev dataset and compare function-name F1 versus your current model.
Add a light validator for parameter types and required fields before executing predicted API calls.
Use the model to generate candidate function sequences and log-only execute them for a week to spot hallucinations.
Agent Features
Planning
- function chaining
- nested function sequencing
- next-best-function prediction
Tool Use
- function name detection
- parameter-value extraction
- parallel/sequence function calls
- JSON-formatted function invocation
Frameworks
- LoRA
- JSON function schema for calls
Is Agentic
true
Architectures
- instruction-tuned code model (Granite family)
- decoder-only transformer (implicit)
Optimization Features
Token Efficiency
- 8192 token context support in base model; trimmed function specs for prompt fit
Infra Optimization
- single-node multi-GPU training (8 A100_80GB)
Model Optimization
- LoRA
System Optimization
- trained on 8x A100-80GB with mixed precision
Training Optimization
- multi-task mixture weighting across tasks and datasets
- instruction tuning with task-specific prompts
Reproducibility
License
- Apache 2.0
Code Urls
Data Urls
- https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2
- reference to API-BLEND (Basu et al., 2024) described in paper
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Context-length limits forced removal of argument types and required/optional flags from function specs during training and evaluation.
- Evaluation on Java/JavaScript/REST categories shows brittleness tied to syntax rules and external API availability.
- Parameter-value extraction lagging behind function-name detection; may require post-processing for safe execution.
When Not To Use
- When you need full typed function specs in prompt and cannot truncate signatures.
- When live API execution depends on strict parameter typing without secondary validation.
- If you need the absolute best argument-filling accuracy (closed models may be slightly better).
Failure Modes
- Missing or incorrect parameter values leading to failed API calls.
- Hallucinated function calls when encountering very different libraries than trained ones (rare but possible).
- Brittleness for language-specific code generation (Java/JavaScript) due to syntax nuance.
Core Entities
Models
- GRANITE-20B-FUNCTIONCALLING
- GRANITE-20B-CODE-INSTRUCT
- Gorilla-openfunctions-v2
- Meta-Llama-3-70B-Instruct
- C4AI-Command-R-v01
- Claude-3.5-Sonnet
- GPT-4-0125-Preview
- Gemini-1.5-Pro
Metrics
- AST summary
- Accuracy
- Relevance
- F1 (func name, params)
- LCS
- Exact match
- BERTScore
- ROUGE-L
- BLEU
- Hallucination rate
Datasets
- API-BLEND
- SeqSGD
- SeqSNIPS
- SeqTopV2
- SeqATIS
- SeqMultiWOZ
- Glaive-V2
Benchmarks
- Berkeley Function Calling Leaderboard (BFCL)
- ToolLLM
- RestGPT
- API-Bank
- ToolBench
- ToolAlpaca
- NexusRaven
Context Entities
Models
- Gorilla (Patil et al.)
- ToolLlama
- ToolAlpaca
- Cohere Command-R
- NexusRaven
Metrics
- Function matching F1 and response generation metrics used for comparison
Datasets
- RapidAPI synthetic sets (mentioned)
- API-Bank (evaluation subset)
Benchmarks
- BFCL (used as main leaderboard)

