Overview
ChatCoT is a low-risk prompting pattern you can prototype quickly with chat LLMs and a few tool APIs; evidence comes from experiments on two public benchmarks showing consistent gains but limited scope beyond math and multi-hop QA.
Citations3
Evidence Strength0.70
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 45%
Why It Matters For Business
If you deploy chat LLMs for complex reasoning, organizing the session as a multi-turn chat that can call calculators, equation solvers, or retrievers reduces end-to-end errors and integrates tools without heavy engineering.
Who Should Care
Summary TLDR
ChatCoT reframes chain-of-thought (CoT) as a multi-turn chat so a chat-based LLM (ChatGPT) can call external tools (calculator, equation solver, retriever) at intermediate steps. The authors initialize an in-chat knowledge memory (tool descriptions, retrieved exemplars, multi-turn reasoning format) and iterate a tool-augmented reasoning step: reason → pick tool → run tool → use result. On MATH and HotpotQA, ChatCoT improves accuracy over standard CoT and prior iterative prompting, while keeping token and runtime costs similar to other prompting methods. Code and data are published.
Problem Statement
LLMs are good at many tasks but struggle on complex multi-step reasoning that needs specific functions (arithmetic, equation solving, document retrieval). Existing ways of mixing tools with CoT either pre-plan tool use (no correction after mistakes) or interrupt generation frequently (break continuity). The paper asks: can we unify CoT and tool use as a natural multi-turn chat so a chat LLM can invoke tools step-by-step and keep coherent reasoning?
Main Contribution
A simple framework (ChatCoT) that models CoT as a multi-turn conversation so chat LLMs can interleave natural-language reasoning and tool calls.
A conversational knowledge memory seeded at early turns: tool descriptions, retrieved similar exemplars, and multi-turn reasoning exemplars.
Key Findings
ChatCoT improves average MATH accuracy over the prior SOTA iterative method (PHP).
ChatCoT greatly raises HotpotQA dev accuracy versus vanilla CoT.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 39.4% | PHP 36.5% | +2.9 points (7.9% relative) | MATH (avg over seven categories) | Table 2 shows ChatCoT 39.4 vs PHP 36.5 | Table 2 |
| Accuracy | 59.2% | CoT 38.0% | +21.2 points | HotpotQA (distractor dev) | Table 3 lists ChatCoT 59.2 vs CoT 38 | Table 3 |
What To Try In 7 Days
Seed a chat prompt with short tool descriptions and one multi-turn exemplar.
Add 2–5 retrieved similar examples (use a cheap embedding retriever) to the early chat context.
Implement a loop: ask model to reason, ask it which tool to use, run tool, feed result back, repeat until answer.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Evaluations use ChatGPT (gpt-3.5-turbo); GPT-4 was not tested.
Designed for chat-style LLMs; compatibility with non-chat LLMs is not demonstrated.
When Not To Use
Simple tasks that do not need external tools or multi-step reasoning.
Environments where you cannot run or trust tool APIs.
Failure Modes
Tool returns error or wrong result and model trusts it without adequate checks.
Retriever provides irrelevant exemplars and the model is distracted.

