xLAM: open-source models (1B–141B) plus a unified function-calling data pipeline that tops the Berkeley Function-Calling leaderboard

September 5, 20248 min

Overview

Decision SnapshotReady For Pilot

The paper shows consistent empirical gains from dataset unification, augmentation, and verified synthetic data across multiple public agent benchmarks; the improvements are practical rather than architectural.

Citations4

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 60%

Authors

Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong

Links

Abstract / PDF / Code

Why It Matters For Business

xLAM provides production-ready, open-source agent models and a reusable data pipeline that reduce dependence on proprietary models for function-calling and tool-heavy workflows, enabling lower-cost deployment and reproducible tool integration.

Who Should Care

Summary TLDR

xLAM is a family of open-source agent models (1.35B to 141B effective params) trained with a unified function-calling data format, heavy augmentation, and APIGen synthesis (60k verified samples from 3,673 APIs). The series achieves state-of-the-art function-calling performance (top-1 on Berkeley Function-Calling Leaderboard v2, 87.31% accuracy) and strong results on ToolBench, Webshop, and ToolQuery. The paper’s practical claim: careful data unification, augmentation, and verified synthetic data can close the gap between open-source and proprietary agent models.

Problem Statement

Open-source agent models lag behind proprietary LLMs because agent datasets are scarce, heterogeneous in format, and noisy (hallucinated tool calls, wrong argument types, duplicated turns). This makes it hard to train models that generalize across many tool-using and multi-turn agent tasks.

Main Contribution

Release of the xLAM model series (xLAM-1b-fc-r, xLAM-7b-fc-r, xLAM-7b-r, xLAM-8x7b-r, xLAM-8x22b-r) for function-calling and general agent use.

A modular unified function-calling data format (task, tools, format instruction, few-shot, steps) to standardize agent trajectories.

Key Findings

Top overall accuracy on Berkeley Function-Calling Leaderboard v2.

Numbers87.31% overall accuracy (xLAM-8x22b-r, BFCL v2 cutoff 09/03/2024)

Practical UseUse xLAM-8x22b-r when you need best open-source function-calling performance on evaluated benchmarks.

Evidence RefTable 5

Smaller models remain competitive after data pipeline and synthesis.

NumbersxLAM-1b-fc-r: 75.43% accuracy (BFCL v2 rank 32)

Practical UseDeploy the 1B model for budget or on-device function-calling tasks while keeping strong accuracy.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy87.31% (xLAM-8x22b-r)GPT-4-0125-preview 85.79%+1.52 ppBerkeley Function-Calling Leaderboard v2 (cutoff 09/03/2024)Table 5: xLAM-8x22b-r ranked #1 at 87.31% overallTable 5
Webshop Success Rate0.414 (xLAM-7b-r)GPT-4-0125-preview 0.375+0.039WebshopTable 2: xLAM-7b-r highest Success RateTable 2

What To Try In 7 Days

Run xLAM-1b-fc-r on one real function-calling task to measure latency and cost vs a hosted API.

Convert a small internal toolset to the unified function-calling JSON format and fine-tune a 7B xLAM checkpoint on that data.

Apply the paper’s prompt-format augmentation (shuffle tools and sections) to your small dataset and measure function-call validity before/after.

Agent Features

Memory
long-context support (up to 64k tokens in xLAM-8x22b-r)
Planning
tool planning (multi-step thought fields)relevance detection (align calls to query)
Tool Use
function calling (single/multiple/parallel)argument formatting and type checking
Frameworks
LoRADPO (preference alignment)FSDP (distributed training)
Is Agentic

Yes

Architectures
dense (1B,7B models)MoE
Collaboration
n/a

Optimization Features

Infra Optimization
NVIDIA H100 trainingPyTorch FSDP + HuggingFace Accelerate
Model Optimization
LoRA
System Optimization
SFT
Training Optimization
SFTDirect Preference Optimization (DPO)Cosine LR scheduler with warmupData-parallel seed diversification
Inference Optimization
smaller FC-specialized models (1B/7B) for single-GPU hosting

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Pipeline benefits are benchmarked mainly on function-calling and tool-use; general reasoning advantages are less emphasized.

APIGen synthetic data and model training occurred before BFCL v2 live updates, so some live-user patterns arrived after training.

When Not To Use

For multimodal tasks—xLAM targets text-based function calling and tool use only.

When you need fully curated human-verified datasets for high-stakes safety audits.

Failure Modes

Undefined function name or undefined arguments in generated function_calls (detected in public datasets).

Incorrect argument types (string vs list) causing execution errors.

Core Entities

Models

xLAM-1b-fc-r (DeepSeek-Coder-1b base, 1.35B)xLAM-7b-fc-r (DeepSeek-Coder-7b base, 6.91B)xLAM-7b-r (Mistral-7b base, 7.24B)xLAM-8x7b-r (Mistral-8x7b base, ~46.7B)xLAM-8x22b-r (Mistral-8x22b base, ~141B)

Metrics

AccuracyPass Rate (ToolBench)Success Rate (Webshop/ToolQuery)Progress Rate (multi-turn progress)

Datasets

APIGen synthetic function-calling (60k samples, 3,673 APIs)ToolBench (eval)Berkeley Function-Calling Leaderboard v2 (BFCL v2, eval)ToolQuery and ToolQuery-Unified (eval)Webshop (eval)DialogStudio (instruction data source)Data Provenance datasets (instruction data sources)

Benchmarks

Berkeley Function-Calling Leaderboard v2ToolBenchToolQueryToolQuery-UnifiedWebshop

Context Entities

Models

Mixtral-8x22b-inst (base comparison)GPT-4 family (various variants, baseline)AgentOhana-8x7b (comparison)

Datasets

ToolBench RapidAPI suitePublic agent datasets combined in unified format