CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

February 12, 20258 min

Overview

Decision SnapshotReady For Pilot

The paper shows clear zero-shot gains across TOD and function-calling benchmarks and provides ablations and human checks; compute cost and single-family training limit generality.

Citations0

Evidence Strength0.90

Confidence0.87

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Emre Can Acikgoz, Jeremiah Greer, Akul Datta, Ze Yang, William Zeng, Oussama Elachqar, Emmanouil Koukoumidis, Dilek Hakkani-Tür, Gokhan Tur

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A single, fine-tuned model can both sustain multi-turn dialogue and call diverse APIs, cutting maintenance and prompt-engineering costs when integrating new services.

Who Should Care

Summary TLDR

The authors create CoALM, a family of Llama-based models fine-tuned on CoALM-IT, a mixed dataset that interleaves task-oriented dialogue (TOD), single-/multi-turn function-calling data, and GPT-4o-generated multi-turn ReAct-style examples. CoALM models (8B, 70B, 405B) keep multi-turn state-tracking while also achieving top function-calling scores on API-Bank and BFCL V3 in zero-shot evaluations. Ablations show each dataset component matters: dropping LA data collapses API performance, dropping DST data hurts joint-goal accuracy, and dropping the ReAct conversational data damages multi-turn success. The authors release code, weights, and data to support replication.

Problem Statement

TOD models handle multi-turn state and task success but fail on diverse API/function calling. Language Agents (LAs) call APIs well but struggle to maintain multi-turn user intent. The field needs a single, practical model that does both without costly per-service retraining.

Main Contribution

CoALM-IT: a unified instruction-tuning dataset that mixes dialogue state tracking (DST), function-calling examples from Hammer/ToolACE, and GPT-4o-generated multi-turn ReAct conversations (CRA).

CoALM model series (8B, 70B, 405B) trained via multitask fine-tuning (LoRA / QLoRA) to combine TOD and agentic skills.

Key Findings

CoALM models achieve high API-Bank L-1 function-calling accuracy.

NumbersCoALM 70B Rouge-L L-1 = 92.7 (Table 3)

Practical UseUse CoALM-style fine-tuning to get very accurate single-step API invocation without per-API engineering.

Evidence RefTable 3

CoALM improves MultiWOZ multi-turn state tracking versus base instruction models.

NumbersCoALM 70B JGA = 43.8 vs Llama 3.1 8B Instruct JGA = 26.3 (Table 2)

Practical UseFine-tune with DST + ReAct samples to boost joint-goal accuracy in multi-turn task bots.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MultiWOZ Success Rate69.4Llama 3.1 8B Instruct 19.9CoALM70B +49.5 pts vs Llama3.1-8BMultiWOZ 2.4 test set (zero-shot)Table 2 shows CoALM 70B Success = 69.4 and Llama 3.1 8B Instruct = 19.9Table 2
Accuracy43.8Llama 3.1 8B Instruct 26.3+17.5 ptsMultiWOZ 2.4 test set (zero-shot)Table 2 CoALM 70B JGA = 43.8 vs Llama3.1 8B = 26.3Table 2

What To Try In 7 Days

Run CoALM 8B or 70B zero-shot on your API schemas to validate function-calling accuracy quickly.

Fine-tune an 8B Llama family model with a small CoALM-IT slice (DST + a few API examples) using LoRA and test MultiWOZ-style scenarios.

Add a small GPT-generated ReAct sample set for your top 20 APIs and spot-check via human validation to reduce hallucinations.

Agent Features

Memory
Short-term multi-turn state tracking (dialogue state)
Planning
ReAct multi-step reasoning (Thought1/Thought2)Action prediction then response generation
Tool Use
Structured function calling (JSON-like)Multi-turn API orchestration
Frameworks
LoRAbitsandbytesOumi training framework
Is Agentic

Yes

Architectures
Llama 3.1Llama 3.3

Optimization Features

Infra Optimization
Used TogetherAI and Microsoft Azure credits for scaling
Model Optimization
LoRA
System Optimization
Trained on 8 NVIDIA H100s (8B/70B); 405B required 16 H100 for inference
Training Optimization
Mixed-precision bf16Global batch size 8, 3 epochs, lr=1e-4, linear warmup 0.1

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Experiments only use Llama family; unclear transfer to other model families.

CoALM-405B inference needs large GPU resources (authors cite 16 H100s), limiting accessibility.

When Not To Use

If you need a model tested on non-Llama architectures without extra tuning.

When GPU resources are constrained and you cannot run large-model inference.

Failure Modes

Argument hallucination in API inputs stemming from synthetic training examples.

Lower multi-turn execution accuracy for very long or highly nested API workflows.

Core Entities

Models

CoALM-8BCoALM-70BCoALM-405BLlama 3.1 8BLlama 3.3 70BLlama 3.1 405BHammerToolACEGranite

Metrics

Success Rate (MultiWOZ)AccuracyRouge-L (API-Bank L1/L2)

Datasets

CoALM-ITSNIPS (DST)Hammer datasetToolACE datasetSGD/CRA (GPT-4o generated)

Benchmarks

MultiWOZ 2.4API-BankBFCL V3 (Berkeley Function Calling Leaderboard)