xLAM: open-source models (1B–141B) plus a unified function-calling data pipeline that tops the Berkeley Function-Calling leaderboard

September 5, 20248 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

4

Authors

Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong

Links

Abstract / PDF

Why It Matters For Business

xLAM provides production-ready, open-source agent models and a reusable data pipeline that reduce dependence on proprietary models for function-calling and tool-heavy workflows, enabling lower-cost deployment and reproducible tool integration.

Summary TLDR

xLAM is a family of open-source agent models (1.35B to 141B effective params) trained with a unified function-calling data format, heavy augmentation, and APIGen synthesis (60k verified samples from 3,673 APIs). The series achieves state-of-the-art function-calling performance (top-1 on Berkeley Function-Calling Leaderboard v2, 87.31% accuracy) and strong results on ToolBench, Webshop, and ToolQuery. The paper’s practical claim: careful data unification, augmentation, and verified synthetic data can close the gap between open-source and proprietary agent models.

Problem Statement

Open-source agent models lag behind proprietary LLMs because agent datasets are scarce, heterogeneous in format, and noisy (hallucinated tool calls, wrong argument types, duplicated turns). This makes it hard to train models that generalize across many tool-using and multi-turn agent tasks.

Main Contribution

Release of the xLAM model series (xLAM-1b-fc-r, xLAM-7b-fc-r, xLAM-7b-r, xLAM-8x7b-r, xLAM-8x22b-r) for function-calling and general agent use.

A modular unified function-calling data format (task, tools, format instruction, few-shot, steps) to standardize agent trajectories.

A data pipeline: unification, prompt-format and instruction-following augmentation, LLM+rule-based quality checks, and APIGen synthesis producing 60k verified function-calling samples from 3,673 APIs.

Empirical results showing top performance on BFCL v2 (xLAM-8x22b-r at 87.31% overall accuracy) and strong wins on ToolBench, Webshop, and ToolQuery; ablations quantify gains from augmentation and cleaning.

Key Findings

Top overall accuracy on Berkeley Function-Calling Leaderboard v2.

Numbers87.31% overall accuracy (xLAM-8x22b-r, BFCL v2 cutoff 09/03/2024)

Smaller models remain competitive after data pipeline and synthesis.

NumbersxLAM-1b-fc-r: 75.43% accuracy (BFCL v2 rank 32)

Data augmentation and cleaning gave measurable gains in benchmarks.

NumbersAugmented vs raw: +2.3% ToolBench, +5.8% Webshop, +18.3% ToolQuery; cleaning added +23.4% on ToolQuery

Training on a unified format improves robustness to structured prompts.

NumbersxLAM-8x22b-r kept ToolQuery-Unified performance while GPT-4o degraded by ~42%

Results

Accuracy

Value87.31% (xLAM-8x22b-r)

BaselineGPT-4-0125-preview 85.79%

Webshop Success Rate

Value0.414 (xLAM-7b-r)

BaselineGPT-4-0125-preview 0.375

ToolQuery Success Rate

Value0.683 (xLAM-8x7b-r and xLAM-8x22b-r)

BaselineMixtral-8x22b-inst 0.400

ToolBench Pass Rate (unseen insts & same set)

Value0.5308 (xLAM-7b-r)

BaselineGPT-4-0125-preview 0.5462

Ablation: augmentation vs raw

Value+2.3% ToolBench, +5.8% Webshop, +18.3% ToolQuery

Baselineraw (pre-unification)

Who Should Care

What To Try In 7 Days

Run xLAM-1b-fc-r on one real function-calling task to measure latency and cost vs a hosted API.

Convert a small internal toolset to the unified function-calling JSON format and fine-tune a 7B xLAM checkpoint on that data.

Apply the paper’s prompt-format augmentation (shuffle tools and sections) to your small dataset and measure function-call validity before/after.

Agent Features

Memory

  • long-context support (up to 64k tokens in xLAM-8x22b-r)

Planning

  • tool planning (multi-step thought fields)
  • relevance detection (align calls to query)

Tool Use

  • function calling (single/multiple/parallel)
  • argument formatting and type checking

Frameworks

  • LoRA
  • DPO (preference alignment)
  • FSDP (distributed training)

Is Agentic

true

Architectures

  • dense (1B,7B models)
  • MoE

Collaboration

  • n/a

Optimization Features

Infra Optimization

  • NVIDIA H100 training
  • PyTorch FSDP + HuggingFace Accelerate

Model Optimization

  • LoRA

System Optimization

  • SFT

Training Optimization

  • SFT
  • Direct Preference Optimization (DPO)
  • Cosine LR scheduler with warmup
  • Data-parallel seed diversification

Inference Optimization

  • smaller FC-specialized models (1B/7B) for single-GPU hosting

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Pipeline benefits are benchmarked mainly on function-calling and tool-use; general reasoning advantages are less emphasized.
  • APIGen synthetic data and model training occurred before BFCL v2 live updates, so some live-user patterns arrived after training.
  • Quality verification relies heavily on LLM judges plus sampled human checks, not large-scale human grading.

When Not To Use

  • For multimodal tasks—xLAM targets text-based function calling and tool use only.
  • When you need fully curated human-verified datasets for high-stakes safety audits.
  • If you require built-in retrieval-augmented knowledge (no RAG pipeline described).

Failure Modes

  • Undefined function name or undefined arguments in generated function_calls (detected in public datasets).
  • Incorrect argument types (string vs list) causing execution errors.
  • Argument hallucination: generated argument values not grounded in query or observations.

Core Entities

Models

  • xLAM-1b-fc-r (DeepSeek-Coder-1b base, 1.35B)
  • xLAM-7b-fc-r (DeepSeek-Coder-7b base, 6.91B)
  • xLAM-7b-r (Mistral-7b base, 7.24B)
  • xLAM-8x7b-r (Mistral-8x7b base, ~46.7B)
  • xLAM-8x22b-r (Mistral-8x22b base, ~141B)

Metrics

  • Accuracy
  • Pass Rate (ToolBench)
  • Success Rate (Webshop/ToolQuery)
  • Progress Rate (multi-turn progress)

Datasets

  • APIGen synthetic function-calling (60k samples, 3,673 APIs)
  • ToolBench (eval)
  • Berkeley Function-Calling Leaderboard v2 (BFCL v2, eval)
  • ToolQuery and ToolQuery-Unified (eval)
  • Webshop (eval)
  • DialogStudio (instruction data source)
  • Data Provenance datasets (instruction data sources)

Benchmarks

  • Berkeley Function-Calling Leaderboard v2
  • ToolBench
  • ToolQuery
  • ToolQuery-Unified
  • Webshop

Context Entities

Models

  • Mixtral-8x22b-inst (base comparison)
  • GPT-4 family (various variants, baseline)
  • AgentOhana-8x7b (comparison)

Datasets

  • ToolBench RapidAPI suite
  • Public agent datasets combined in unified format