Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
A single, fine-tuned model can both sustain multi-turn dialogue and call diverse APIs, cutting maintenance and prompt-engineering costs when integrating new services.
Summary TLDR
The authors create CoALM, a family of Llama-based models fine-tuned on CoALM-IT, a mixed dataset that interleaves task-oriented dialogue (TOD), single-/multi-turn function-calling data, and GPT-4o-generated multi-turn ReAct-style examples. CoALM models (8B, 70B, 405B) keep multi-turn state-tracking while also achieving top function-calling scores on API-Bank and BFCL V3 in zero-shot evaluations. Ablations show each dataset component matters: dropping LA data collapses API performance, dropping DST data hurts joint-goal accuracy, and dropping the ReAct conversational data damages multi-turn success. The authors release code, weights, and data to support replication.
Problem Statement
TOD models handle multi-turn state and task success but fail on diverse API/function calling. Language Agents (LAs) call APIs well but struggle to maintain multi-turn user intent. The field needs a single, practical model that does both without costly per-service retraining.
Main Contribution
CoALM-IT: a unified instruction-tuning dataset that mixes dialogue state tracking (DST), function-calling examples from Hammer/ToolACE, and GPT-4o-generated multi-turn ReAct conversations (CRA).
CoALM model series (8B, 70B, 405B) trained via multitask fine-tuning (LoRA / QLoRA) to combine TOD and agentic skills.
Detailed zero-shot evaluations on MultiWOZ 2.4 (TOD), API-Bank (tool use), and BFCL V3 (function calling), plus dataset ablations showing component importance.
Open release of code, model weights, datasets, and training configs to encourage reproducibility.
Key Findings
CoALM models achieve high API-Bank L-1 function-calling accuracy.
CoALM improves MultiWOZ multi-turn state tracking versus base instruction models.
Removing the LA component from fine-tuning severely damages API performance.
CoALM 405B achieves highest BFCL V3 overall accuracy among reported open models.
Human validation found errors in GPT-4o-generated CRA examples, but error rate is low.
Results
MultiWOZ Success Rate
Accuracy
API-Bank Rouge-L (L-1)
Accuracy
Who Should Care
What To Try In 7 Days
Run CoALM 8B or 70B zero-shot on your API schemas to validate function-calling accuracy quickly.
Fine-tune an 8B Llama family model with a small CoALM-IT slice (DST + a few API examples) using LoRA and test MultiWOZ-style scenarios.
Add a small GPT-generated ReAct sample set for your top 20 APIs and spot-check via human validation to reduce hallucinations.
Agent Features
Memory
- Short-term multi-turn state tracking (dialogue state)
Planning
- ReAct multi-step reasoning (Thought1/Thought2)
- Action prediction then response generation
Tool Use
- Structured function calling (JSON-like)
- Multi-turn API orchestration
Frameworks
- LoRA
- bitsandbytes
- Oumi training framework
Is Agentic
true
Architectures
- Llama 3.1
- Llama 3.3
Optimization Features
Infra Optimization
- Used TogetherAI and Microsoft Azure credits for scaling
Model Optimization
- LoRA
System Optimization
- Trained on 8 NVIDIA H100s (8B/70B); 405B required 16 H100 for inference
Training Optimization
- Mixed-precision bf16
- Global batch size 8, 3 epochs, lr=1e-4, linear warmup 0.1
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Experiments only use Llama family; unclear transfer to other model families.
- CoALM-405B inference needs large GPU resources (authors cite 16 H100s), limiting accessibility.
- Possible catastrophic forgetting and general reasoning effects not systematically measured after fine-tuning.
- CRA data is GPT-4o generated and had a 9% error rate on human spot-checks.
When Not To Use
- If you need a model tested on non-Llama architectures without extra tuning.
- When GPU resources are constrained and you cannot run large-model inference.
- When strict, certified safety guarantees are required without further auditing.
Failure Modes
- Argument hallucination in API inputs stemming from synthetic training examples.
- Lower multi-turn execution accuracy for very long or highly nested API workflows.
- Potential drop in general reasoning tasks not covered by fine-tuning data.
Core Entities
Models
- CoALM-8B
- CoALM-70B
- CoALM-405B
- Llama 3.1 8B
- Llama 3.3 70B
- Llama 3.1 405B
- Hammer
- ToolACE
- Granite
Metrics
- Success Rate (MultiWOZ)
- Accuracy
- Rouge-L (API-Bank L1/L2)
Datasets
- CoALM-IT
- SNIPS (DST)
- Hammer dataset
- ToolACE dataset
- SGD/CRA (GPT-4o generated)
Benchmarks
- MultiWOZ 2.4
- API-Bank
- BFCL V3 (Berkeley Function Calling Leaderboard)

