CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

February 12, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Emre Can Acikgoz, Jeremiah Greer, Akul Datta, Ze Yang, William Zeng, Oussama Elachqar, Emmanouil Koukoumidis, Dilek Hakkani-Tür, Gokhan Tur

Links

Abstract / PDF

Why It Matters For Business

A single, fine-tuned model can both sustain multi-turn dialogue and call diverse APIs, cutting maintenance and prompt-engineering costs when integrating new services.

Summary TLDR

The authors create CoALM, a family of Llama-based models fine-tuned on CoALM-IT, a mixed dataset that interleaves task-oriented dialogue (TOD), single-/multi-turn function-calling data, and GPT-4o-generated multi-turn ReAct-style examples. CoALM models (8B, 70B, 405B) keep multi-turn state-tracking while also achieving top function-calling scores on API-Bank and BFCL V3 in zero-shot evaluations. Ablations show each dataset component matters: dropping LA data collapses API performance, dropping DST data hurts joint-goal accuracy, and dropping the ReAct conversational data damages multi-turn success. The authors release code, weights, and data to support replication.

Problem Statement

TOD models handle multi-turn state and task success but fail on diverse API/function calling. Language Agents (LAs) call APIs well but struggle to maintain multi-turn user intent. The field needs a single, practical model that does both without costly per-service retraining.

Main Contribution

CoALM-IT: a unified instruction-tuning dataset that mixes dialogue state tracking (DST), function-calling examples from Hammer/ToolACE, and GPT-4o-generated multi-turn ReAct conversations (CRA).

CoALM model series (8B, 70B, 405B) trained via multitask fine-tuning (LoRA / QLoRA) to combine TOD and agentic skills.

Detailed zero-shot evaluations on MultiWOZ 2.4 (TOD), API-Bank (tool use), and BFCL V3 (function calling), plus dataset ablations showing component importance.

Open release of code, model weights, datasets, and training configs to encourage reproducibility.

Key Findings

CoALM models achieve high API-Bank L-1 function-calling accuracy.

NumbersCoALM 70B Rouge-L L-1 = 92.7 (Table 3)

CoALM improves MultiWOZ multi-turn state tracking versus base instruction models.

NumbersCoALM 70B JGA = 43.8 vs Llama 3.1 8B Instruct JGA = 26.3 (Table 2)

Removing the LA component from fine-tuning severely damages API performance.

NumbersAPI-Bank Rouge-L1 drops 47.3% when LA data removed (Table 5)

CoALM 405B achieves highest BFCL V3 overall accuracy among reported open models.

NumbersCoALM 405B overall acc = 63.34% vs GPT-4o = 59.83% (Table 4)

Human validation found errors in GPT-4o-generated CRA examples, but error rate is low.

Numbers9% error rate on 100 sampled CRA dialogues (Appendix D)

Results

MultiWOZ Success Rate

Value69.4

BaselineLlama 3.1 8B Instruct 19.9

Accuracy

Value43.8

BaselineLlama 3.1 8B Instruct 26.3

API-Bank Rouge-L (L-1)

Value92.7

BaselineLlama 3.1 8B Instruct 72.7

Accuracy

Value63.34

BaselineGPT-4o 59.83

Who Should Care

What To Try In 7 Days

Run CoALM 8B or 70B zero-shot on your API schemas to validate function-calling accuracy quickly.

Fine-tune an 8B Llama family model with a small CoALM-IT slice (DST + a few API examples) using LoRA and test MultiWOZ-style scenarios.

Add a small GPT-generated ReAct sample set for your top 20 APIs and spot-check via human validation to reduce hallucinations.

Agent Features

Memory

  • Short-term multi-turn state tracking (dialogue state)

Planning

  • ReAct multi-step reasoning (Thought1/Thought2)
  • Action prediction then response generation

Tool Use

  • Structured function calling (JSON-like)
  • Multi-turn API orchestration

Frameworks

  • LoRA
  • bitsandbytes
  • Oumi training framework

Is Agentic

true

Architectures

  • Llama 3.1
  • Llama 3.3

Optimization Features

Infra Optimization

  • Used TogetherAI and Microsoft Azure credits for scaling

Model Optimization

  • LoRA

System Optimization

  • Trained on 8 NVIDIA H100s (8B/70B); 405B required 16 H100 for inference

Training Optimization

  • Mixed-precision bf16
  • Global batch size 8, 3 epochs, lr=1e-4, linear warmup 0.1

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Experiments only use Llama family; unclear transfer to other model families.
  • CoALM-405B inference needs large GPU resources (authors cite 16 H100s), limiting accessibility.
  • Possible catastrophic forgetting and general reasoning effects not systematically measured after fine-tuning.
  • CRA data is GPT-4o generated and had a 9% error rate on human spot-checks.

When Not To Use

  • If you need a model tested on non-Llama architectures without extra tuning.
  • When GPU resources are constrained and you cannot run large-model inference.
  • When strict, certified safety guarantees are required without further auditing.

Failure Modes

  • Argument hallucination in API inputs stemming from synthetic training examples.
  • Lower multi-turn execution accuracy for very long or highly nested API workflows.
  • Potential drop in general reasoning tasks not covered by fine-tuning data.

Core Entities

Models

  • CoALM-8B
  • CoALM-70B
  • CoALM-405B
  • Llama 3.1 8B
  • Llama 3.3 70B
  • Llama 3.1 405B
  • Hammer
  • ToolACE
  • Granite

Metrics

  • Success Rate (MultiWOZ)
  • Accuracy
  • Rouge-L (API-Bank L1/L2)

Datasets

  • CoALM-IT
  • SNIPS (DST)
  • Hammer dataset
  • ToolACE dataset
  • SGD/CRA (GPT-4o generated)

Benchmarks

  • MultiWOZ 2.4
  • API-Bank
  • BFCL V3 (Berkeley Function Calling Leaderboard)