xLAM: open-source models (1B–141B) plus a unified function-calling data pipeline that tops the Berkeley Function-Calling leaderboard

Overview

Decision SnapshotReady For Pilot

The paper shows consistent empirical gains from dataset unification, augmentation, and verified synthetic data across multiple public agent benchmarks; the improvements are practical rather than architectural.

Citations4

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 60%

Authors

Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong

Links

Abstract / PDF / Code

Why It Matters For Business

xLAM provides production-ready, open-source agent models and a reusable data pipeline that reduce dependence on proprietary models for function-calling and tool-heavy workflows, enabling lower-cost deployment and reproducible tool integration.

Who Should Care

ML Engineer Product Manager Founder CTO

Summary TLDR

xLAM is a family of open-source agent models (1.35B to 141B effective params) trained with a unified function-calling data format, heavy augmentation, and APIGen synthesis (60k verified samples from 3,673 APIs). The series achieves state-of-the-art function-calling performance (top-1 on Berkeley Function-Calling Leaderboard v2, 87.31% accuracy) and strong results on ToolBench, Webshop, and ToolQuery. The paper’s practical claim: careful data unification, augmentation, and verified synthetic data can close the gap between open-source and proprietary agent models.

Problem Statement

Open-source agent models lag behind proprietary LLMs because agent datasets are scarce, heterogeneous in format, and noisy (hallucinated tool calls, wrong argument types, duplicated turns). This makes it hard to train models that generalize across many tool-using and multi-turn agent tasks.

Main Contribution

Release of the xLAM model series (xLAM-1b-fc-r, xLAM-7b-fc-r, xLAM-7b-r, xLAM-8x7b-r, xLAM-8x22b-r) for function-calling and general agent use.

A modular unified function-calling data format (task, tools, format instruction, few-shot, steps) to standardize agent trajectories.

Key Findings

Top overall accuracy on Berkeley Function-Calling Leaderboard v2.

Numbers87.31% overall accuracy (xLAM-8x22b-r, BFCL v2 cutoff 09/03/2024)

Practical UseUse xLAM-8x22b-r when you need best open-source function-calling performance on evaluated benchmarks.

Evidence RefTable 5

Smaller models remain competitive after data pipeline and synthesis.

NumbersxLAM-1b-fc-r: 75.43% accuracy (BFCL v2 rank 32)

Practical UseDeploy the 1B model for budget or on-device function-calling tasks while keeping strong accuracy.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	87.31% (xLAM-8x22b-r)	GPT-4-0125-preview 85.79%	+1.52 pp	Berkeley Function-Calling Leaderboard v2 (cutoff 09/03/2024)	Table 5: xLAM-8x22b-r ranked #1 at 87.31% overall	Table 5
Webshop Success Rate	0.414 (xLAM-7b-r)	GPT-4-0125-preview 0.375	+0.039	Webshop	Table 2: xLAM-7b-r highest Success Rate	Table 2

What To Try In 7 Days

Run xLAM-1b-fc-r on one real function-calling task to measure latency and cost vs a hosted API.

Convert a small internal toolset to the unified function-calling JSON format and fine-tune a 7B xLAM checkpoint on that data.

Apply the paper’s prompt-format augmentation (shuffle tools and sections) to your small dataset and measure function-call validity before/after.

Agent Features

Memory

long-context support (up to 64k tokens in xLAM-8x22b-r)

Planning

tool planning (multi-step thought fields)relevance detection (align calls to query)

Tool Use

function calling (single/multiple/parallel)argument formatting and type checking

Frameworks

LoRADPO (preference alignment)FSDP (distributed training)

Is Agentic

Yes

Architectures

dense (1B,7B models)MoE

Collaboration

n/a

Optimization Features

Infra Optimization

NVIDIA H100 trainingPyTorch FSDP + HuggingFace Accelerate

Model Optimization

LoRA

System Optimization

SFT

Training Optimization

SFTDirect Preference Optimization (DPO)Cosine LR scheduler with warmupData-parallel seed diversification

Inference Optimization

smaller FC-specialized models (1B/7B) for single-GPU hosting

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/SalesforceAIResearch/xLAM https://huggingface.co/Salesforce/xLAM-models

Risks & Boundaries

Limitations

Pipeline benefits are benchmarked mainly on function-calling and tool-use; general reasoning advantages are less emphasized.

APIGen synthetic data and model training occurred before BFCL v2 live updates, so some live-user patterns arrived after training.

When Not To Use

For multimodal tasks—xLAM targets text-based function calling and tool use only.

When you need fully curated human-verified datasets for high-stakes safety audits.

Failure Modes

Undefined function name or undefined arguments in generated function_calls (detected in public datasets).

Incorrect argument types (string vs list) causing execution errors.

Core Entities

Models

xLAM-1b-fc-r (DeepSeek-Coder-1b base, 1.35B)xLAM-7b-fc-r (DeepSeek-Coder-7b base, 6.91B)xLAM-7b-r (Mistral-7b base, 7.24B)xLAM-8x7b-r (Mistral-8x7b base, ~46.7B)xLAM-8x22b-r (Mistral-8x22b base, ~141B)

Metrics

AccuracyPass Rate (ToolBench)Success Rate (Webshop/ToolQuery)Progress Rate (multi-turn progress)

Datasets

APIGen synthetic function-calling (60k samples, 3,673 APIs)ToolBench (eval)Berkeley Function-Calling Leaderboard v2 (BFCL v2, eval)ToolQuery and ToolQuery-Unified (eval)Webshop (eval)DialogStudio (instruction data source)Data Provenance datasets (instruction data sources)

Benchmarks

Berkeley Function-Calling Leaderboard v2ToolBenchToolQueryToolQuery-UnifiedWebshop

Context Entities

Models

Mixtral-8x22b-inst (base comparison)GPT-4 family (various variants, baseline)AgentOhana-8x7b (comparison)

Datasets

ToolBench RapidAPI suitePublic agent datasets combined in unified format

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top overall accuracy on Berkeley Function-Calling Leaderboard v2.

Smaller models remain competitive after data pipeline and synthesis.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

Key finding

Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

Key finding