Train LLMs to plan with abstract placeholders, then fill them with tools to reason faster and more accurately

January 30, 20249 min

Overview

Decision SnapshotReady For Pilot

CoA requires moderate fine-tuning and a tool-execution pipeline but yields reproducible gains on math and wiki QA; evidence is solid within evaluated benchmarks but limited to the two domains and full-model fine-tuning setups.

Citations5

Evidence Strength0.75

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 5/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Silin Gao, Jane Dwivedi-Yu, Ping Yu, Xiaoqing Ellen Tan, Ramakanth Pasunuru, Olga Golovneva, Koustuv Sinha, Asli Celikyilmaz, Antoine Bosselut, Tianlu Wang

Links

Abstract / PDF

Why It Matters For Business

CoA makes multi-step tool use both more accurate and faster by separating plan generation from tool calls; this reduces arithmetic bugs and shortens latency when pipelines must call external APIs.

Who Should Care

Summary TLDR

CoA (Chain-of-Abstraction) fine-tunes LLMs to first generate multi-step reasoning traces that use abstract placeholders, then calls domain tools once to fill (reify) those placeholders. This separates planning from fetching concrete facts or calculations, improving accuracy on math and Wikipedia QA, lowering arithmetic errors to zero in humans tests, and reducing end-to-end inference time by ~1.3–1.5× on evaluated tasks.

Problem Statement

Tool-augmented LLMs often call APIs interleaved with generation. Interleaving (1) forces the model to plan and compute at the same time, hurting multi-step planning and robustness, and (2) incurs repeated waiting for tool responses, slowing inference. The paper aims to make multi-step tool use both more accurate and faster.

Main Contribution

Chain-of-Abstraction (CoA): fine-tune LLMs to output abstract multi-step reasoning traces with placeholders, then call tools to fill placeholders and produce answers.

A data construction pipeline that rewrites gold answers into CoA traces using LLaMa-70B and validates rewrites with domain tools (equation solver or Wiki search + NER).

Key Findings

CoA improves QA accuracy on evaluated math benchmarks.

NumbersGSM8K: +~2.9~6.8 pp absolute (varies by model); average ~7.5% reported

Practical UseIf you fine-tune an LLaMa-family model with CoA, expect noticeable accuracy gains on multi-step math problems versus standard CoT and some tool baselines.

Evidence RefAbstract; Table 8; Table 4

CoA improves open-domain (Wikipedia) QA accuracy on evaluated benchmarks.

NumbersHotpotQA Both: +~511 pp absolute; paper reports ~4.5% average

Practical UseUse CoA when answers require chaining multiple Wiki lookups — you should get better exact-match rates than plain CoT or Toolformer variants on Hotpot-style tasks.

Evidence RefAbstract; Table 7; Table 9

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyCoA 38.29% vs CoT-FT 35.41%CoT-FT+2.88 ppGSM8KTable 8; Table 4Table 8
Accuracyaverage ~7.5% absolute improvement (paper statement)CoT / tool baselines~+7.5 ppmath benchmarks aggregateAbstract; §5.1Abstract

What To Try In 7 Days

Rewrite a sample of your multi-step QA/agent outputs into abstract-step traces and validate with your tools (calculator, search).

Fine-tune a small LLM checkpoint on 1–3k CoA-style examples and compare accuracy and latency to your current pipeline.

Batch or pipeline tool calls so placeholder reification can run in parallel with decoding next samples.

Agent Features

Memory
No long-term retrieval memory described
Planning
Generate abstract multi-step reasoning traces with placeholdersHolistic planning of interconnected tool calls
Tool Use
Call equation solver to reify arithmetic placeholdersCall Wikipedia search (BM25 + SBERT re-rank) and NER to fill factual placeholders
Frameworks
Fine-tuning on CoA traces constructed with LLaMa-70B re-writing
Is Agentic

Yes

Architectures
LLaMa family (7B, 70B, LLaMa-2, LLaMa-2-Chat)
Collaboration
Decouples planner LLM from tool executors for pipeline parallelism

Optimization Features

Token Efficiency
Introduces short placeholder tokens but reduces overall tool-related overhead
Infra Optimization
Amortizes latency across examples and can batch tool execution
Model Optimization
Fine-tuning on small curated CoA datasets (2k–3k examples)
System Optimization
Pipeline design: decode CoA trace; invoke tools once; decode final answer from reified trace
Training Optimization
Balanced sampling across reasoning-step counts to avoid bias to multi-step problems
Inference Optimization
Parallelize decoding and tool calls across examplesBatch reification of placeholders to avoid multiple sequential API waits

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluations limited to two domains (math and Wikipedia QA) and English only.

Method requires full-model fine-tuning in the paper; lower-cost variants (LoRA) are suggested but not evaluated.

When Not To Use

When you cannot fine-tune the base model due to resource limits and cannot apply lightweight adapters.

For domains where you cannot construct reliable tool validators to verify CoA rewrites.

Failure Modes

Incorrect placeholder alignment: wrong mapping between abstract variables and tool results may corrupt final answers.

Poor CoA rewrite quality in complex text domains can train the model on faulty plans.

Core Entities

Models

LLaMa-7BLLaMa-70BLLaMa-2LLaMa-2-ChatToolformerFireActPALDECLARATIVE

Metrics

AccuracyHuman-evaluated arithmetic error rateHuman-evaluated reasoning error rateWall-clock inference time (s per question)

Datasets

GSM8KASDivSVAMPMAWPSHotpotQAWebQuestionsNaturalQuestionsTriviaQA

Benchmarks

math reasoning (GSM8K, MAWPS, SVAMP)Wikipedia QA (HotpotQA, WQ, NQ, TriviaQA)