Turn CoT into a chat: let chat LLMs call tools step-by-step to improve math and multi-hop QA

May 23, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.45

Cost Impact Score

0.5

Citation Count

3

Authors

Zhipeng Chen, Kun Zhou, Beichen Zhang, Zheng Gong, Wayne Xin Zhao, Ji-Rong Wen

Links

Abstract / PDF

Why It Matters For Business

If you deploy chat LLMs for complex reasoning, organizing the session as a multi-turn chat that can call calculators, equation solvers, or retrievers reduces end-to-end errors and integrates tools without heavy engineering.

Summary TLDR

ChatCoT reframes chain-of-thought (CoT) as a multi-turn chat so a chat-based LLM (ChatGPT) can call external tools (calculator, equation solver, retriever) at intermediate steps. The authors initialize an in-chat knowledge memory (tool descriptions, retrieved exemplars, multi-turn reasoning format) and iterate a tool-augmented reasoning step: reason → pick tool → run tool → use result. On MATH and HotpotQA, ChatCoT improves accuracy over standard CoT and prior iterative prompting, while keeping token and runtime costs similar to other prompting methods. Code and data are published.

Problem Statement

LLMs are good at many tasks but struggle on complex multi-step reasoning that needs specific functions (arithmetic, equation solving, document retrieval). Existing ways of mixing tools with CoT either pre-plan tool use (no correction after mistakes) or interrupt generation frequently (break continuity). The paper asks: can we unify CoT and tool use as a natural multi-turn chat so a chat LLM can invoke tools step-by-step and keep coherent reasoning?

Main Contribution

A simple framework (ChatCoT) that models CoT as a multi-turn conversation so chat LLMs can interleave natural-language reasoning and tool calls.

A conversational knowledge memory seeded at early turns: tool descriptions, retrieved similar exemplars, and multi-turn reasoning exemplars.

An iterative tool-augmented reasoning step (reason → choose tool → execute → integrate) with optional feedback rounds when tool outputs are unsatisfactory.

Empirical results on MATH and HotpotQA showing consistent accuracy gains versus strong CoT baselines and an ablation study isolating each memory component.

Key Findings

ChatCoT improves average MATH accuracy over the prior SOTA iterative method (PHP).

NumbersMATH Avg: ChatCoT 39.4 vs PHP 36.5 (7.9% relative)

ChatCoT greatly raises HotpotQA dev accuracy versus vanilla CoT.

NumbersHotpotQA: ChatCoT 59.2 vs CoT 38.0 (21.2 points abs)

ChatCoT increases both how often the model uses tools and how often tool calls succeed.

NumbersNumber Theory task: frequency 70.0%, success 92.0% (ChatCoT)

Each conversational memory piece matters; removing retrieval or format drops accuracy.

NumbersAblation (Precalculus): full 23.8 → w/o RATK 20.0 (−3.8 abs)

ChatCoT's token generation (proxy for cost) is same order as other CoT methods.

NumbersGenerated tokens: ChatCoT 355.2 vs CoT 224.6 and CoT w/ Tool 296.2

Results

Accuracy

Value39.4%

BaselinePHP 36.5%

Accuracy

Value59.2%

BaselineCoT 38.0%

tool call frequency (Number Theory)

Value70.0%

BaselineCoT w/ Tool 3.0%

tool success rate (Number Theory)

Value92.0%

BaselineCoT w/ Tool 85.7%

average generated tokens

Value355.2

BaselineCoT 224.6

Who Should Care

What To Try In 7 Days

Seed a chat prompt with short tool descriptions and one multi-turn exemplar.

Add 2–5 retrieved similar examples (use a cheap embedding retriever) to the early chat context.

Implement a loop: ask model to reason, ask it which tool to use, run tool, feed result back, repeat until answer.

Agent Features

Memory

  • conversational knowledge memory (tool descriptions, retrieved exemplars, multi-turn format)

Planning

  • no upfront fixed tool plan; per-turn tool selection

Tool Use

  • calculator
  • equation solver
  • retriever

Frameworks

  • ChatCoT iterative step (reason → select tool → execute → integrate)

Is Agentic

true

Architectures

  • chat-based LLM multi-turn prompting
  • retrieval-augmented exemplars

Collaboration

  • model ↔ tool calls coordinated by a lightweight agent (prompted rules)

Optimization Features

Token Efficiency

  • same-order token use as CoT (modest increase)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations use ChatGPT (gpt-3.5-turbo); GPT-4 was not tested.
  • Designed for chat-style LLMs; compatibility with non-chat LLMs is not demonstrated.
  • Current experiments focus on math and HotpotQA; generality to other reasoning types is untested.
  • Tool errors and retriever noise can mislead the chat and require feedback loops.

When Not To Use

  • Simple tasks that do not need external tools or multi-step reasoning.
  • Environments where you cannot run or trust tool APIs.
  • Non-chat LLM deployments that cannot maintain multi-turn context.

Failure Modes

  • Tool returns error or wrong result and model trusts it without adequate checks.
  • Retriever provides irrelevant exemplars and the model is distracted.
  • Model continues chatting past the answer (over-chatting) unless forced to stop.

Core Entities

Models

  • ChatGPT (gpt-3.5-turbo)
  • GPT-3
  • PaLM
  • Minerva
  • PaLM 2
  • LLaMA
  • Galactica

Metrics

  • Accuracy
  • generated tokens
  • tool frequency
  • tool success rate

Datasets

  • MATH
  • HotpotQA

Benchmarks

  • MATH (seven subcategories)
  • HotpotQA distractor setting

Context Entities

Models

  • ChatGPT used as backbone implementation

Metrics

  • Accuracy
  • token counts

Datasets

  • MATH train/dev/test splits
  • HotpotQA distractor dev