An open 20B model trained to spot, sequence, and call APIs reliably — ranks 4th on Berkeley's function-calling leaderboard.

Overview

Decision SnapshotReady For Pilot

The paper provides extensive zero-shot comparisons on multiple benchmarks and publishes the model under Apache 2.0; results are strong for function selection but parameter extraction still needs validation in production.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/6

Reproducibility

Status: Code + data available

Open source: Yes

License: Apache 2.0

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stallone, Rameswar Panda, Yara Rizk, GP Bhargav, Maxwell Crouse, Chulaka Gunasekara, Shajith Ikbal, Sachin Joshi, Hima Karanam, Vineet Kumar, Asim Munawar, Sumit Neelam, Dinesh Raghu, Udit Sharma, Adriana Meza Soria, Dheeraj Sreedhar, Praveen Venkateswaran, Merve Unuvar, David Cox, Salim Roukos, Luis Lastras, Pavan Kapanipathi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

GRANITE-20B-FUNCTIONCALLING is an open, production-ready model for reliable API selection and response synthesis; it lowers risk from calling wrong APIs and offers a license-friendly alternative to closed models.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

The authors release GRANITE-20B-FUNCTIONCALLING, a 20B-parameter, instruction-tuned code model trained with a multi-task mixture focused on seven granular function-calling tasks (name detection, param extraction, sequencing, chaining, parallel calls, next-best function, and response generation). Trained with QLoRA on ~142K examples drawn from API-BLEND and Glaive-V2, the model is the best open-license entry on the Berkeley Function Calling Leaderboard (BFCL) and shows strong out-of-domain generalization on several academic benchmarks. It is good at choosing which functions to call; weaker on fully filling parameters. The model and weights are released under Apache 2.0.

Problem Statement

LLMs used as agents must reliably identify, sequence, and invoke external APIs. Existing function-calling models often fail on generalization, on handling fine-grained sub-tasks (e.g., parameter extraction, next-best-function), or are proprietary. The paper aims to build an open model that learns these granular tasks jointly and generalizes to out-of-domain benchmarks.

Main Contribution

Released GRANITE-20B-FUNCTIONCALLING, an Apache-2.0 open 20B instruction-tuned model focused on function calling.

Designed a multi-task training mixture covering seven granular function-calling tasks using API-BLEND and Glaive-V2 (~142K examples).

Key Findings

GRANITE-20B-FUNCTIONCALLING ranks 4th on BFCL overall accuracy and is the top open-license model.

NumbersOverall Acc. 84.71 on BFCL (Table 4)

Practical UseIf you need an open model for production tool use, this is the strongest open alternative to proprietary models on BFCL.

Evidence RefTable 4

Model is especially strong at detecting which functions to call from text.

NumbersFunction name detection avg F1 = 0.74; LCS = 0.73; Exact match = 0.43 (Table 5)

Practical UseUse this model when you primarily need accurate function selection and sequencing; it reduces wrong API choices and hallucinated function names.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	84.71	—	—	BFCL (zero-shot)	Ranked 4th overall and top among open-license models	Table 4
BFCL AST / Execution / Relevance	AST 84.11 \| Exec 86.5 \| Relevance 87.08	—	—	BFCL	Per-category scores reported on BFCL	Table 4

What To Try In 7 Days

Run the Hugging Face release of GRANITE-20B-FUNCTIONCALLING on a dev dataset and compare function-name F1 versus your current model.

Add a light validator for parameter types and required fields before executing predicted API calls.

Use the model to generate candidate function sequences and log-only execute them for a week to spot hallucinations.

Agent Features

Planning

function chainingnested function sequencingnext-best-function prediction

Tool Use

function name detectionparameter-value extractionparallel/sequence function callsJSON-formatted function invocation

Frameworks

LoRAJSON function schema for calls

Is Agentic

Yes

Architectures

instruction-tuned code model (Granite family)decoder-only transformer (implicit)

Optimization Features

Token Efficiency

8192 token context support in base model; trimmed function specs for prompt fit

Infra Optimization

single-node multi-GPU training (8 A100_80GB)

Model Optimization

LoRA

System Optimization

trained on 8x A100-80GB with mixed precision

Training Optimization

multi-task mixture weighting across tasks and datasetsinstruction tuning with task-specific prompts

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseApache 2.0

Code URLs

https://huggingface.co/ibm-granite/

Data URLs

https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2reference to API-BLEND (Basu et al., 2024) described in paper

Risks & Boundaries

Limitations

Context-length limits forced removal of argument types and required/optional flags from function specs during training and evaluation.

Evaluation on Java/JavaScript/REST categories shows brittleness tied to syntax rules and external API availability.

When Not To Use

When you need full typed function specs in prompt and cannot truncate signatures.

When live API execution depends on strict parameter typing without secondary validation.

Failure Modes

Missing or incorrect parameter values leading to failed API calls.

Hallucinated function calls when encountering very different libraries than trained ones (rare but possible).

Core Entities

Models

GRANITE-20B-FUNCTIONCALLINGGRANITE-20B-CODE-INSTRUCTGorilla-openfunctions-v2Meta-Llama-3-70B-InstructC4AI-Command-R-v01Claude-3.5-SonnetGPT-4-0125-PreviewGemini-1.5-Pro

Metrics

AST summaryAccuracyRelevanceF1 (func name, params)LCSExact matchBERTScoreROUGE-LBLEUHallucination rate

Datasets

API-BLENDSeqSGDSeqSNIPSSeqTopV2SeqATISSeqMultiWOZGlaive-V2

Benchmarks

Berkeley Function Calling Leaderboard (BFCL)ToolLLMRestGPTAPI-BankToolBenchToolAlpacaNexusRaven

Context Entities

Models

Gorilla (Patil et al.)ToolLlamaToolAlpacaCohere Command-RNexusRaven

Metrics

Function matching F1 and response generation metrics used for comparison

Datasets

RapidAPI synthetic sets (mentioned)API-Bank (evaluation subset)

Benchmarks

BFCL (used as main leaderboard)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GRANITE-20B-FUNCTIONCALLING ranks 4th on BFCL overall accuracy and is the top open-license model.

Model is especially strong at detecting which functions to call from text.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

Key finding

Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

Key finding