Turn CoT into a chat: let chat LLMs call tools step-by-step to improve math and multi-hop QA

Overview

Decision SnapshotNeeds Validation

ChatCoT is a low-risk prompting pattern you can prototype quickly with chat LLMs and a few tool APIs; evidence comes from experiments on two public benchmarks showing consistent gains but limited scope beyond math and multi-hop QA.

Citations3

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 45%

Authors

Zhipeng Chen, Kun Zhou, Beichen Zhang, Zheng Gong, Wayne Xin Zhao, Ji-Rong Wen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy chat LLMs for complex reasoning, organizing the session as a multi-turn chat that can call calculators, equation solvers, or retrievers reduces end-to-end errors and integrates tools without heavy engineering.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

ChatCoT reframes chain-of-thought (CoT) as a multi-turn chat so a chat-based LLM (ChatGPT) can call external tools (calculator, equation solver, retriever) at intermediate steps. The authors initialize an in-chat knowledge memory (tool descriptions, retrieved exemplars, multi-turn reasoning format) and iterate a tool-augmented reasoning step: reason → pick tool → run tool → use result. On MATH and HotpotQA, ChatCoT improves accuracy over standard CoT and prior iterative prompting, while keeping token and runtime costs similar to other prompting methods. Code and data are published.

Problem Statement

LLMs are good at many tasks but struggle on complex multi-step reasoning that needs specific functions (arithmetic, equation solving, document retrieval). Existing ways of mixing tools with CoT either pre-plan tool use (no correction after mistakes) or interrupt generation frequently (break continuity). The paper asks: can we unify CoT and tool use as a natural multi-turn chat so a chat LLM can invoke tools step-by-step and keep coherent reasoning?

Main Contribution

A simple framework (ChatCoT) that models CoT as a multi-turn conversation so chat LLMs can interleave natural-language reasoning and tool calls.

A conversational knowledge memory seeded at early turns: tool descriptions, retrieved similar exemplars, and multi-turn reasoning exemplars.

Key Findings

ChatCoT improves average MATH accuracy over the prior SOTA iterative method (PHP).

NumbersMATH Avg: ChatCoT 39.4 vs PHP 36.5 (7.9% relative)

Practical UseIf you use ChatGPT for math problems, structuring prompts as a multi-turn chat with tool calls can yield measurably better accuracy than single-pass iterative prompting.

Evidence RefTable 2

ChatCoT greatly raises HotpotQA dev accuracy versus vanilla CoT.

NumbersHotpotQA: ChatCoT 59.2 vs CoT 38.0 (21.2 points abs)

Practical UseFor multi-hop QA, letting the model retrieve and request retrieval feedback inside a chat can convert many previously failed cases into correct answers.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	39.4%	PHP 36.5%	+2.9 points (7.9% relative)	MATH (avg over seven categories)	Table 2 shows ChatCoT 39.4 vs PHP 36.5	Table 2
Accuracy	59.2%	CoT 38.0%	+21.2 points	HotpotQA (distractor dev)	Table 3 lists ChatCoT 59.2 vs CoT 38	Table 3

What To Try In 7 Days

Seed a chat prompt with short tool descriptions and one multi-turn exemplar.

Add 2–5 retrieved similar examples (use a cheap embedding retriever) to the early chat context.

Implement a loop: ask model to reason, ask it which tool to use, run tool, feed result back, repeat until answer.

Agent Features

Memory

conversational knowledge memory (tool descriptions, retrieved exemplars, multi-turn format)

Planning

no upfront fixed tool plan; per-turn tool selection

Tool Use

calculatorequation solverretriever

Frameworks

ChatCoT iterative step (reason → select tool → execute → integrate)

Is Agentic

Yes

Architectures

chat-based LLM multi-turn promptingretrieval-augmented exemplars

Collaboration

model ↔ tool calls coordinated by a lightweight agent (prompted rules)

Optimization Features

Token Efficiency

same-order token use as CoT (modest increase)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/RUCAIBOX/ChatCoT

Data URLs

https://github.com/RUCAIBOX/ChatCoT (paper assets); MATH and HotpotQA public datasets

Risks & Boundaries

Limitations

Evaluations use ChatGPT (gpt-3.5-turbo); GPT-4 was not tested.

Designed for chat-style LLMs; compatibility with non-chat LLMs is not demonstrated.

When Not To Use

Simple tasks that do not need external tools or multi-step reasoning.

Environments where you cannot run or trust tool APIs.

Failure Modes

Tool returns error or wrong result and model trusts it without adequate checks.

Retriever provides irrelevant exemplars and the model is distracted.

Core Entities

Models

ChatGPT (gpt-3.5-turbo)GPT-3PaLMMinervaPaLM 2LLaMAGalactica

Metrics

Accuracygenerated tokenstool frequencytool success rate

Datasets

MATHHotpotQA

Benchmarks

MATH (seven subcategories)HotpotQA distractor setting

Context Entities

Models

ChatGPT used as backbone implementation

Metrics

Accuracytoken counts

Datasets

MATH train/dev/test splitsHotpotQA distractor dev

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ChatCoT improves average MATH accuracy over the prior SOTA iterative method (PHP).

ChatCoT greatly raises HotpotQA dev accuracy versus vanilla CoT.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

ETAPP: an 800-case sandbox benchmark and key-point LLM evaluator for personalized tool use

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

Key finding

Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

Key finding