Overview
The paper shows clear zero-shot gains across TOD and function-calling benchmarks and provides ablations and human checks; compute cost and single-family training limit generality.
Citations0
Evidence Strength0.90
Confidence0.87
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
A single, fine-tuned model can both sustain multi-turn dialogue and call diverse APIs, cutting maintenance and prompt-engineering costs when integrating new services.
Who Should Care
Summary TLDR
The authors create CoALM, a family of Llama-based models fine-tuned on CoALM-IT, a mixed dataset that interleaves task-oriented dialogue (TOD), single-/multi-turn function-calling data, and GPT-4o-generated multi-turn ReAct-style examples. CoALM models (8B, 70B, 405B) keep multi-turn state-tracking while also achieving top function-calling scores on API-Bank and BFCL V3 in zero-shot evaluations. Ablations show each dataset component matters: dropping LA data collapses API performance, dropping DST data hurts joint-goal accuracy, and dropping the ReAct conversational data damages multi-turn success. The authors release code, weights, and data to support replication.
Problem Statement
TOD models handle multi-turn state and task success but fail on diverse API/function calling. Language Agents (LAs) call APIs well but struggle to maintain multi-turn user intent. The field needs a single, practical model that does both without costly per-service retraining.
Main Contribution
CoALM-IT: a unified instruction-tuning dataset that mixes dialogue state tracking (DST), function-calling examples from Hammer/ToolACE, and GPT-4o-generated multi-turn ReAct conversations (CRA).
CoALM model series (8B, 70B, 405B) trained via multitask fine-tuning (LoRA / QLoRA) to combine TOD and agentic skills.
Key Findings
CoALM models achieve high API-Bank L-1 function-calling accuracy.
CoALM improves MultiWOZ multi-turn state tracking versus base instruction models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| MultiWOZ Success Rate | 69.4 | Llama 3.1 8B Instruct 19.9 | CoALM70B +49.5 pts vs Llama3.1-8B | MultiWOZ 2.4 test set (zero-shot) | Table 2 shows CoALM 70B Success = 69.4 and Llama 3.1 8B Instruct = 19.9 | Table 2 |
| Accuracy | 43.8 | Llama 3.1 8B Instruct 26.3 | +17.5 pts | MultiWOZ 2.4 test set (zero-shot) | Table 2 CoALM 70B JGA = 43.8 vs Llama3.1 8B = 26.3 | Table 2 |
What To Try In 7 Days
Run CoALM 8B or 70B zero-shot on your API schemas to validate function-calling accuracy quickly.
Fine-tune an 8B Llama family model with a small CoALM-IT slice (DST + a few API examples) using LoRA and test MultiWOZ-style scenarios.
Add a small GPT-generated ReAct sample set for your top 20 APIs and spot-check via human validation to reduce hallucinations.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments only use Llama family; unclear transfer to other model families.
CoALM-405B inference needs large GPU resources (authors cite 16 H100s), limiting accessibility.
When Not To Use
If you need a model tested on non-Llama architectures without extra tuning.
When GPU resources are constrained and you cannot run large-model inference.
Failure Modes
Argument hallucination in API inputs stemming from synthetic training examples.
Lower multi-turn execution accuracy for very long or highly nested API workflows.

