An open 20B model trained to spot, sequence, and call APIs reliably — ranks 4th on Berkeley's function-calling leaderboard.

June 27, 20249 min

Overview

Decision SnapshotReady For Pilot

The paper provides extensive zero-shot comparisons on multiple benchmarks and publishes the model under Apache 2.0; results are strong for function selection but parameter extraction still needs validation in production.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/6

Reproducibility

Status: Code + data available

Open source: Yes

License: Apache 2.0

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stallone, Rameswar Panda, Yara Rizk, GP Bhargav, Maxwell Crouse, Chulaka Gunasekara, Shajith Ikbal, Sachin Joshi, Hima Karanam, Vineet Kumar, Asim Munawar, Sumit Neelam, Dinesh Raghu, Udit Sharma, Adriana Meza Soria, Dheeraj Sreedhar, Praveen Venkateswaran, Merve Unuvar, David Cox, Salim Roukos, Luis Lastras, Pavan Kapanipathi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

GRANITE-20B-FUNCTIONCALLING is an open, production-ready model for reliable API selection and response synthesis; it lowers risk from calling wrong APIs and offers a license-friendly alternative to closed models.

Who Should Care

Summary TLDR

The authors release GRANITE-20B-FUNCTIONCALLING, a 20B-parameter, instruction-tuned code model trained with a multi-task mixture focused on seven granular function-calling tasks (name detection, param extraction, sequencing, chaining, parallel calls, next-best function, and response generation). Trained with QLoRA on ~142K examples drawn from API-BLEND and Glaive-V2, the model is the best open-license entry on the Berkeley Function Calling Leaderboard (BFCL) and shows strong out-of-domain generalization on several academic benchmarks. It is good at choosing which functions to call; weaker on fully filling parameters. The model and weights are released under Apache 2.0.

Problem Statement

LLMs used as agents must reliably identify, sequence, and invoke external APIs. Existing function-calling models often fail on generalization, on handling fine-grained sub-tasks (e.g., parameter extraction, next-best-function), or are proprietary. The paper aims to build an open model that learns these granular tasks jointly and generalizes to out-of-domain benchmarks.

Main Contribution

Released GRANITE-20B-FUNCTIONCALLING, an Apache-2.0 open 20B instruction-tuned model focused on function calling.

Designed a multi-task training mixture covering seven granular function-calling tasks using API-BLEND and Glaive-V2 (~142K examples).

Key Findings

GRANITE-20B-FUNCTIONCALLING ranks 4th on BFCL overall accuracy and is the top open-license model.

NumbersOverall Acc. 84.71 on BFCL (Table 4)

Practical UseIf you need an open model for production tool use, this is the strongest open alternative to proprietary models on BFCL.

Evidence RefTable 4

Model is especially strong at detecting which functions to call from text.

NumbersFunction name detection avg F1 = 0.74; LCS = 0.73; Exact match = 0.43 (Table 5)

Practical UseUse this model when you primarily need accurate function selection and sequencing; it reduces wrong API choices and hallucinated function names.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy84.71BFCL (zero-shot)Ranked 4th overall and top among open-license modelsTable 4
BFCL AST / Execution / RelevanceAST 84.11 | Exec 86.5 | Relevance 87.08BFCLPer-category scores reported on BFCLTable 4

What To Try In 7 Days

Run the Hugging Face release of GRANITE-20B-FUNCTIONCALLING on a dev dataset and compare function-name F1 versus your current model.

Add a light validator for parameter types and required fields before executing predicted API calls.

Use the model to generate candidate function sequences and log-only execute them for a week to spot hallucinations.

Agent Features

Planning
function chainingnested function sequencingnext-best-function prediction
Tool Use
function name detectionparameter-value extractionparallel/sequence function callsJSON-formatted function invocation
Frameworks
LoRAJSON function schema for calls
Is Agentic

Yes

Architectures
instruction-tuned code model (Granite family)decoder-only transformer (implicit)

Optimization Features

Token Efficiency
8192 token context support in base model; trimmed function specs for prompt fit
Infra Optimization
single-node multi-GPU training (8 A100_80GB)
Model Optimization
LoRA
System Optimization
trained on 8x A100-80GB with mixed precision
Training Optimization
multi-task mixture weighting across tasks and datasetsinstruction tuning with task-specific prompts

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseApache 2.0

Data URLs

https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2reference to API-BLEND (Basu et al., 2024) described in paper

Risks & Boundaries

Limitations

Context-length limits forced removal of argument types and required/optional flags from function specs during training and evaluation.

Evaluation on Java/JavaScript/REST categories shows brittleness tied to syntax rules and external API availability.

When Not To Use

When you need full typed function specs in prompt and cannot truncate signatures.

When live API execution depends on strict parameter typing without secondary validation.

Failure Modes

Missing or incorrect parameter values leading to failed API calls.

Hallucinated function calls when encountering very different libraries than trained ones (rare but possible).

Core Entities

Models

GRANITE-20B-FUNCTIONCALLINGGRANITE-20B-CODE-INSTRUCTGorilla-openfunctions-v2Meta-Llama-3-70B-InstructC4AI-Command-R-v01Claude-3.5-SonnetGPT-4-0125-PreviewGemini-1.5-Pro

Metrics

AST summaryAccuracyRelevanceF1 (func name, params)LCSExact matchBERTScoreROUGE-LBLEUHallucination rate

Datasets

API-BLENDSeqSGDSeqSNIPSSeqTopV2SeqATISSeqMultiWOZGlaive-V2

Benchmarks

Berkeley Function Calling Leaderboard (BFCL)ToolLLMRestGPTAPI-BankToolBenchToolAlpacaNexusRaven

Context Entities

Models

Gorilla (Patil et al.)ToolLlamaToolAlpacaCohere Command-RNexusRaven

Metrics

Function matching F1 and response generation metrics used for comparison

Datasets

RapidAPI synthetic sets (mentioned)API-Bank (evaluation subset)

Benchmarks

BFCL (used as main leaderboard)