Train LLMs to plan with abstract placeholders, then fill them with tools to reason faster and more accurately

January 30, 20249 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

5

Authors

Silin Gao, Jane Dwivedi-Yu, Ping Yu, Xiaoqing Ellen Tan, Ramakanth Pasunuru, Olga Golovneva, Koustuv Sinha, Asli Celikyilmaz, Antoine Bosselut, Tianlu Wang

Links

Abstract / PDF

Why It Matters For Business

CoA makes multi-step tool use both more accurate and faster by separating plan generation from tool calls; this reduces arithmetic bugs and shortens latency when pipelines must call external APIs.

Summary TLDR

CoA (Chain-of-Abstraction) fine-tunes LLMs to first generate multi-step reasoning traces that use abstract placeholders, then calls domain tools once to fill (reify) those placeholders. This separates planning from fetching concrete facts or calculations, improving accuracy on math and Wikipedia QA, lowering arithmetic errors to zero in humans tests, and reducing end-to-end inference time by ~1.3–1.5× on evaluated tasks.

Problem Statement

Tool-augmented LLMs often call APIs interleaved with generation. Interleaving (1) forces the model to plan and compute at the same time, hurting multi-step planning and robustness, and (2) incurs repeated waiting for tool responses, slowing inference. The paper aims to make multi-step tool use both more accurate and faster.

Main Contribution

Chain-of-Abstraction (CoA): fine-tune LLMs to output abstract multi-step reasoning traces with placeholders, then call tools to fill placeholders and produce answers.

A data construction pipeline that rewrites gold answers into CoA traces using LLaMa-70B and validates rewrites with domain tools (equation solver or Wiki search + NER).

Demonstration on two domains (math and Wikipedia QA) that CoA improves accuracy over chain-of-thought and tool-augmented baselines while reducing arithmetic errors and inference latency.

Key Findings

CoA improves QA accuracy on evaluated math benchmarks.

NumbersGSM8K: +~2.9–~6.8 pp absolute (varies by model); average ~7.5% reported

CoA improves open-domain (Wikipedia) QA accuracy on evaluated benchmarks.

NumbersHotpotQA Both: +~5–11 pp absolute; paper reports ~4.5% average

CoA cuts arithmetic errors to zero in human evaluation sample.

NumbersArithmetic error rate 0.0% vs 17.3% (CoT-FSP) and 25.2% (CoT-FT)

CoA reduces reasoning error rates in human judgment.

NumbersReasoning error rate 60.4% vs 70.3% (CoT-FSP) and 67.8% (CoT-FT), ≈7–10 pp improvement

CoA speeds up end-to-end inference on evaluated domains.

NumbersInference speedups reported: math ~1.47×, wiki ~1.33×; abstract mentions ~1.4× average

CoA fine-tuning data rewriting success varies by domain.

NumbersMath: ~76.6% of CoA traces verified; Wiki QA: ~15.9% verified

CoA needs modest in-domain fine-tuning data and compute in these experiments.

Numbers~2K math QAs, ~3K wiki QAs; training runs ~2–5 hours on 8–64 A100 GPUs

Results

Accuracy

ValueCoA 38.29% vs CoT-FT 35.41%

BaselineCoT-FT

Accuracy

Valueaverage ~7.5% absolute improvement (paper statement)

BaselineCoT / tool baselines

HotpotQA exact match (LLaMa-2-Chat-7B, Both)

ValueCoA 28.22% vs CoT-FT 22.77%

BaselineCoT-FT

Inference speed

Valuemath ~1.47× faster; wiki ~1.33× faster

BaselineTool-augmented LLM baselines

Human arithmetic error rate

ValueCoA 0.0% vs CoT-FT 25.2%

BaselineCoT-FT

CoA data rewrite verification rate

ValueMath ~76.6% of rewrites verified; Wiki ~15.9%

Baselinen/a

Who Should Care

What To Try In 7 Days

Rewrite a sample of your multi-step QA/agent outputs into abstract-step traces and validate with your tools (calculator, search).

Fine-tune a small LLM checkpoint on 1–3k CoA-style examples and compare accuracy and latency to your current pipeline.

Batch or pipeline tool calls so placeholder reification can run in parallel with decoding next samples.

Agent Features

Memory

  • No long-term retrieval memory described

Planning

  • Generate abstract multi-step reasoning traces with placeholders
  • Holistic planning of interconnected tool calls

Tool Use

  • Call equation solver to reify arithmetic placeholders
  • Call Wikipedia search (BM25 + SBERT re-rank) and NER to fill factual placeholders

Frameworks

  • Fine-tuning on CoA traces constructed with LLaMa-70B re-writing

Is Agentic

true

Architectures

  • LLaMa family (7B, 70B, LLaMa-2, LLaMa-2-Chat)

Collaboration

  • Decouples planner LLM from tool executors for pipeline parallelism

Optimization Features

Token Efficiency

  • Introduces short placeholder tokens but reduces overall tool-related overhead

Infra Optimization

  • Amortizes latency across examples and can batch tool execution

Model Optimization

  • Fine-tuning on small curated CoA datasets (2k–3k examples)

System Optimization

  • Pipeline design: decode CoA trace; invoke tools once; decode final answer from reified trace

Training Optimization

  • Balanced sampling across reasoning-step counts to avoid bias to multi-step problems

Inference Optimization

  • Parallelize decoding and tool calls across examples
  • Batch reification of placeholders to avoid multiple sequential API waits

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations limited to two domains (math and Wikipedia QA) and English only.
  • Method requires full-model fine-tuning in the paper; lower-cost variants (LoRA) are suggested but not evaluated.
  • CoA data rewriting is harder for free-text retrieval tasks (low automatic success rate ~15.9% for Wiki).

When Not To Use

  • When you cannot fine-tune the base model due to resource limits and cannot apply lightweight adapters.
  • For domains where you cannot construct reliable tool validators to verify CoA rewrites.
  • When tool latency is negligible and interleaved calling is simpler and sufficient.

Failure Modes

  • Incorrect placeholder alignment: wrong mapping between abstract variables and tool results may corrupt final answers.
  • Poor CoA rewrite quality in complex text domains can train the model on faulty plans.
  • If tools return noisy or ambiguous outputs, the reified chain can still yield incorrect conclusions despite correct planning.

Core Entities

Models

  • LLaMa-7B
  • LLaMa-70B
  • LLaMa-2
  • LLaMa-2-Chat
  • Toolformer
  • FireAct
  • PAL
  • DECLARATIVE

Metrics

  • Accuracy
  • Human-evaluated arithmetic error rate
  • Human-evaluated reasoning error rate
  • Wall-clock inference time (s per question)

Datasets

  • GSM8K
  • ASDiv
  • SVAMP
  • MAWPS
  • HotpotQA
  • WebQuestions
  • NaturalQuestions
  • TriviaQA

Benchmarks

  • math reasoning (GSM8K, MAWPS, SVAMP)
  • Wikipedia QA (HotpotQA, WQ, NQ, TriviaQA)