Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
4
Why It Matters For Business
xLAM provides production-ready, open-source agent models and a reusable data pipeline that reduce dependence on proprietary models for function-calling and tool-heavy workflows, enabling lower-cost deployment and reproducible tool integration.
Summary TLDR
xLAM is a family of open-source agent models (1.35B to 141B effective params) trained with a unified function-calling data format, heavy augmentation, and APIGen synthesis (60k verified samples from 3,673 APIs). The series achieves state-of-the-art function-calling performance (top-1 on Berkeley Function-Calling Leaderboard v2, 87.31% accuracy) and strong results on ToolBench, Webshop, and ToolQuery. The paper’s practical claim: careful data unification, augmentation, and verified synthetic data can close the gap between open-source and proprietary agent models.
Problem Statement
Open-source agent models lag behind proprietary LLMs because agent datasets are scarce, heterogeneous in format, and noisy (hallucinated tool calls, wrong argument types, duplicated turns). This makes it hard to train models that generalize across many tool-using and multi-turn agent tasks.
Main Contribution
Release of the xLAM model series (xLAM-1b-fc-r, xLAM-7b-fc-r, xLAM-7b-r, xLAM-8x7b-r, xLAM-8x22b-r) for function-calling and general agent use.
A modular unified function-calling data format (task, tools, format instruction, few-shot, steps) to standardize agent trajectories.
A data pipeline: unification, prompt-format and instruction-following augmentation, LLM+rule-based quality checks, and APIGen synthesis producing 60k verified function-calling samples from 3,673 APIs.
Empirical results showing top performance on BFCL v2 (xLAM-8x22b-r at 87.31% overall accuracy) and strong wins on ToolBench, Webshop, and ToolQuery; ablations quantify gains from augmentation and cleaning.
Key Findings
Top overall accuracy on Berkeley Function-Calling Leaderboard v2.
Smaller models remain competitive after data pipeline and synthesis.
Data augmentation and cleaning gave measurable gains in benchmarks.
Training on a unified format improves robustness to structured prompts.
Results
Accuracy
Webshop Success Rate
ToolQuery Success Rate
ToolBench Pass Rate (unseen insts & same set)
Ablation: augmentation vs raw
Who Should Care
What To Try In 7 Days
Run xLAM-1b-fc-r on one real function-calling task to measure latency and cost vs a hosted API.
Convert a small internal toolset to the unified function-calling JSON format and fine-tune a 7B xLAM checkpoint on that data.
Apply the paper’s prompt-format augmentation (shuffle tools and sections) to your small dataset and measure function-call validity before/after.
Agent Features
Memory
- long-context support (up to 64k tokens in xLAM-8x22b-r)
Planning
- tool planning (multi-step thought fields)
- relevance detection (align calls to query)
Tool Use
- function calling (single/multiple/parallel)
- argument formatting and type checking
Frameworks
- LoRA
- DPO (preference alignment)
- FSDP (distributed training)
Is Agentic
true
Architectures
- dense (1B,7B models)
- MoE
Collaboration
- n/a
Optimization Features
Infra Optimization
- NVIDIA H100 training
- PyTorch FSDP + HuggingFace Accelerate
Model Optimization
- LoRA
System Optimization
- SFT
Training Optimization
- SFT
- Direct Preference Optimization (DPO)
- Cosine LR scheduler with warmup
- Data-parallel seed diversification
Inference Optimization
- smaller FC-specialized models (1B/7B) for single-GPU hosting
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Pipeline benefits are benchmarked mainly on function-calling and tool-use; general reasoning advantages are less emphasized.
- APIGen synthetic data and model training occurred before BFCL v2 live updates, so some live-user patterns arrived after training.
- Quality verification relies heavily on LLM judges plus sampled human checks, not large-scale human grading.
When Not To Use
- For multimodal tasks—xLAM targets text-based function calling and tool use only.
- When you need fully curated human-verified datasets for high-stakes safety audits.
- If you require built-in retrieval-augmented knowledge (no RAG pipeline described).
Failure Modes
- Undefined function name or undefined arguments in generated function_calls (detected in public datasets).
- Incorrect argument types (string vs list) causing execution errors.
- Argument hallucination: generated argument values not grounded in query or observations.
Core Entities
Models
- xLAM-1b-fc-r (DeepSeek-Coder-1b base, 1.35B)
- xLAM-7b-fc-r (DeepSeek-Coder-7b base, 6.91B)
- xLAM-7b-r (Mistral-7b base, 7.24B)
- xLAM-8x7b-r (Mistral-8x7b base, ~46.7B)
- xLAM-8x22b-r (Mistral-8x22b base, ~141B)
Metrics
- Accuracy
- Pass Rate (ToolBench)
- Success Rate (Webshop/ToolQuery)
- Progress Rate (multi-turn progress)
Datasets
- APIGen synthetic function-calling (60k samples, 3,673 APIs)
- ToolBench (eval)
- Berkeley Function-Calling Leaderboard v2 (BFCL v2, eval)
- ToolQuery and ToolQuery-Unified (eval)
- Webshop (eval)
- DialogStudio (instruction data source)
- Data Provenance datasets (instruction data sources)
Benchmarks
- Berkeley Function-Calling Leaderboard v2
- ToolBench
- ToolQuery
- ToolQuery-Unified
- Webshop
Context Entities
Models
- Mixtral-8x22b-inst (base comparison)
- GPT-4 family (various variants, baseline)
- AgentOhana-8x7b (comparison)
Datasets
- ToolBench RapidAPI suite
- Public agent datasets combined in unified format

