An open 20B model trained to spot, sequence, and call APIs reliably — ranks 4th on Berkeley's function-calling leaderboard.

June 27, 20249 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

1

Authors

Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stallone, Rameswar Panda, Yara Rizk, GP Bhargav, Maxwell Crouse, Chulaka Gunasekara, Shajith Ikbal, Sachin Joshi, Hima Karanam, Vineet Kumar, Asim Munawar, Sumit Neelam, Dinesh Raghu, Udit Sharma, Adriana Meza Soria, Dheeraj Sreedhar, Praveen Venkateswaran, Merve Unuvar, David Cox, Salim Roukos, Luis Lastras, Pavan Kapanipathi

Links

Abstract / PDF

Why It Matters For Business

GRANITE-20B-FUNCTIONCALLING is an open, production-ready model for reliable API selection and response synthesis; it lowers risk from calling wrong APIs and offers a license-friendly alternative to closed models.

Summary TLDR

The authors release GRANITE-20B-FUNCTIONCALLING, a 20B-parameter, instruction-tuned code model trained with a multi-task mixture focused on seven granular function-calling tasks (name detection, param extraction, sequencing, chaining, parallel calls, next-best function, and response generation). Trained with QLoRA on ~142K examples drawn from API-BLEND and Glaive-V2, the model is the best open-license entry on the Berkeley Function Calling Leaderboard (BFCL) and shows strong out-of-domain generalization on several academic benchmarks. It is good at choosing which functions to call; weaker on fully filling parameters. The model and weights are released under Apache 2.0.

Problem Statement

LLMs used as agents must reliably identify, sequence, and invoke external APIs. Existing function-calling models often fail on generalization, on handling fine-grained sub-tasks (e.g., parameter extraction, next-best-function), or are proprietary. The paper aims to build an open model that learns these granular tasks jointly and generalizes to out-of-domain benchmarks.

Main Contribution

Released GRANITE-20B-FUNCTIONCALLING, an Apache-2.0 open 20B instruction-tuned model focused on function calling.

Designed a multi-task training mixture covering seven granular function-calling tasks using API-BLEND and Glaive-V2 (~142K examples).

Used QLoRA fine-tuning (rank-8) on GRANITE-20B-CODE-INSTRUCT and trained for 3 epochs on 8x A100-80GB.

Comprehensive zero-shot evaluation across BFCL and six out-of-domain academic benchmarks; best open-model on BFCL and strong generalization.

Key Findings

GRANITE-20B-FUNCTIONCALLING ranks 4th on BFCL overall accuracy and is the top open-license model.

NumbersOverall Acc. 84.71 on BFCL (Table 4)

Model is especially strong at detecting which functions to call from text.

NumbersFunction name detection avg F1 = 0.74; LCS = 0.73; Exact match = 0.43 (Table 5)

Identifying arguments (parameters and values) is weaker than function-name detection.

NumbersFull calling: func-name avg F1 = 0.87 | args avg F1 = 0.59 (Table 6)

Natural-language response quality is near the best evaluated model.

NumbersAPI-Bank response: BertScore 0.68 vs Meta-Llama-3 0.69; Rouge-L 0.47 vs 0.48; BLEU 0.47 vs 0.47 (Table 7)

Hallucination rate for predicting function names is low.

NumbersHallucination rate < 0.1 while being top-performing on out-of-domain datasets (Figure 3)

Results

Accuracy

Value84.71

BFCL AST / Execution / Relevance

ValueAST 84.11 | Exec 86.5 | Relevance 87.08

Function name detection (avg)

ValueFunc-match F1 0.74 | LCS 0.73 | Exact 0.43

BaselineNext best open model lower by ~8% F1 (text)

Full function calling (name+args) average F1

ValueName F1 0.87 | Args F1 0.59

BaselineC4AI-Command-R name F1 0.88, args F1 0.62 (best on some sets)

Response generation (API-Bank)

ValueBertScore 0.68 | Rouge-L 0.47 | BLEU 0.47 (level 1 avg)

BaselineMeta-Llama-3-70B: BertScore 0.69 | Rouge-L 0.48 | BLEU 0.47

Hallucination rate (function names)

Value< 0.10

Who Should Care

What To Try In 7 Days

Run the Hugging Face release of GRANITE-20B-FUNCTIONCALLING on a dev dataset and compare function-name F1 versus your current model.

Add a light validator for parameter types and required fields before executing predicted API calls.

Use the model to generate candidate function sequences and log-only execute them for a week to spot hallucinations.

Agent Features

Planning

  • function chaining
  • nested function sequencing
  • next-best-function prediction

Tool Use

  • function name detection
  • parameter-value extraction
  • parallel/sequence function calls
  • JSON-formatted function invocation

Frameworks

  • LoRA
  • JSON function schema for calls

Is Agentic

true

Architectures

  • instruction-tuned code model (Granite family)
  • decoder-only transformer (implicit)

Optimization Features

Token Efficiency

  • 8192 token context support in base model; trimmed function specs for prompt fit

Infra Optimization

  • single-node multi-GPU training (8 A100_80GB)

Model Optimization

  • LoRA

System Optimization

  • trained on 8x A100-80GB with mixed precision

Training Optimization

  • multi-task mixture weighting across tasks and datasets
  • instruction tuning with task-specific prompts

Reproducibility

License

  • Apache 2.0

Data Urls

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Context-length limits forced removal of argument types and required/optional flags from function specs during training and evaluation.
  • Evaluation on Java/JavaScript/REST categories shows brittleness tied to syntax rules and external API availability.
  • Parameter-value extraction lagging behind function-name detection; may require post-processing for safe execution.

When Not To Use

  • When you need full typed function specs in prompt and cannot truncate signatures.
  • When live API execution depends on strict parameter typing without secondary validation.
  • If you need the absolute best argument-filling accuracy (closed models may be slightly better).

Failure Modes

  • Missing or incorrect parameter values leading to failed API calls.
  • Hallucinated function calls when encountering very different libraries than trained ones (rare but possible).
  • Brittleness for language-specific code generation (Java/JavaScript) due to syntax nuance.

Core Entities

Models

  • GRANITE-20B-FUNCTIONCALLING
  • GRANITE-20B-CODE-INSTRUCT
  • Gorilla-openfunctions-v2
  • Meta-Llama-3-70B-Instruct
  • C4AI-Command-R-v01
  • Claude-3.5-Sonnet
  • GPT-4-0125-Preview
  • Gemini-1.5-Pro

Metrics

  • AST summary
  • Accuracy
  • Relevance
  • F1 (func name, params)
  • LCS
  • Exact match
  • BERTScore
  • ROUGE-L
  • BLEU
  • Hallucination rate

Datasets

  • API-BLEND
  • SeqSGD
  • SeqSNIPS
  • SeqTopV2
  • SeqATIS
  • SeqMultiWOZ
  • Glaive-V2

Benchmarks

  • Berkeley Function Calling Leaderboard (BFCL)
  • ToolLLM
  • RestGPT
  • API-Bank
  • ToolBench
  • ToolAlpaca
  • NexusRaven

Context Entities

Models

  • Gorilla (Patil et al.)
  • ToolLlama
  • ToolAlpaca
  • Cohere Command-R
  • NexusRaven

Metrics

  • Function matching F1 and response generation metrics used for comparison

Datasets

  • RapidAPI synthetic sets (mentioned)
  • API-Bank (evaluation subset)

Benchmarks

  • BFCL (used as main leaderboard)