Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
5
Why It Matters For Business
CoA makes multi-step tool use both more accurate and faster by separating plan generation from tool calls; this reduces arithmetic bugs and shortens latency when pipelines must call external APIs.
Summary TLDR
CoA (Chain-of-Abstraction) fine-tunes LLMs to first generate multi-step reasoning traces that use abstract placeholders, then calls domain tools once to fill (reify) those placeholders. This separates planning from fetching concrete facts or calculations, improving accuracy on math and Wikipedia QA, lowering arithmetic errors to zero in humans tests, and reducing end-to-end inference time by ~1.3–1.5× on evaluated tasks.
Problem Statement
Tool-augmented LLMs often call APIs interleaved with generation. Interleaving (1) forces the model to plan and compute at the same time, hurting multi-step planning and robustness, and (2) incurs repeated waiting for tool responses, slowing inference. The paper aims to make multi-step tool use both more accurate and faster.
Main Contribution
Chain-of-Abstraction (CoA): fine-tune LLMs to output abstract multi-step reasoning traces with placeholders, then call tools to fill placeholders and produce answers.
A data construction pipeline that rewrites gold answers into CoA traces using LLaMa-70B and validates rewrites with domain tools (equation solver or Wiki search + NER).
Demonstration on two domains (math and Wikipedia QA) that CoA improves accuracy over chain-of-thought and tool-augmented baselines while reducing arithmetic errors and inference latency.
Key Findings
CoA improves QA accuracy on evaluated math benchmarks.
CoA improves open-domain (Wikipedia) QA accuracy on evaluated benchmarks.
CoA cuts arithmetic errors to zero in human evaluation sample.
CoA reduces reasoning error rates in human judgment.
CoA speeds up end-to-end inference on evaluated domains.
CoA fine-tuning data rewriting success varies by domain.
CoA needs modest in-domain fine-tuning data and compute in these experiments.
Results
Accuracy
Accuracy
HotpotQA exact match (LLaMa-2-Chat-7B, Both)
Inference speed
Human arithmetic error rate
CoA data rewrite verification rate
Who Should Care
What To Try In 7 Days
Rewrite a sample of your multi-step QA/agent outputs into abstract-step traces and validate with your tools (calculator, search).
Fine-tune a small LLM checkpoint on 1–3k CoA-style examples and compare accuracy and latency to your current pipeline.
Batch or pipeline tool calls so placeholder reification can run in parallel with decoding next samples.
Agent Features
Memory
- No long-term retrieval memory described
Planning
- Generate abstract multi-step reasoning traces with placeholders
- Holistic planning of interconnected tool calls
Tool Use
- Call equation solver to reify arithmetic placeholders
- Call Wikipedia search (BM25 + SBERT re-rank) and NER to fill factual placeholders
Frameworks
- Fine-tuning on CoA traces constructed with LLaMa-70B re-writing
Is Agentic
true
Architectures
- LLaMa family (7B, 70B, LLaMa-2, LLaMa-2-Chat)
Collaboration
- Decouples planner LLM from tool executors for pipeline parallelism
Optimization Features
Token Efficiency
- Introduces short placeholder tokens but reduces overall tool-related overhead
Infra Optimization
- Amortizes latency across examples and can batch tool execution
Model Optimization
- Fine-tuning on small curated CoA datasets (2k–3k examples)
System Optimization
- Pipeline design: decode CoA trace; invoke tools once; decode final answer from reified trace
Training Optimization
- Balanced sampling across reasoning-step counts to avoid bias to multi-step problems
Inference Optimization
- Parallelize decoding and tool calls across examples
- Batch reification of placeholders to avoid multiple sequential API waits
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluations limited to two domains (math and Wikipedia QA) and English only.
- Method requires full-model fine-tuning in the paper; lower-cost variants (LoRA) are suggested but not evaluated.
- CoA data rewriting is harder for free-text retrieval tasks (low automatic success rate ~15.9% for Wiki).
When Not To Use
- When you cannot fine-tune the base model due to resource limits and cannot apply lightweight adapters.
- For domains where you cannot construct reliable tool validators to verify CoA rewrites.
- When tool latency is negligible and interleaved calling is simpler and sufficient.
Failure Modes
- Incorrect placeholder alignment: wrong mapping between abstract variables and tool results may corrupt final answers.
- Poor CoA rewrite quality in complex text domains can train the model on faulty plans.
- If tools return noisy or ambiguous outputs, the reified chain can still yield incorrect conclusions despite correct planning.
Core Entities
Models
- LLaMa-7B
- LLaMa-70B
- LLaMa-2
- LLaMa-2-Chat
- Toolformer
- FireAct
- PAL
- DECLARATIVE
Metrics
- Accuracy
- Human-evaluated arithmetic error rate
- Human-evaluated reasoning error rate
- Wall-clock inference time (s per question)
Datasets
- GSM8K
- ASDiv
- SVAMP
- MAWPS
- HotpotQA
- WebQuestions
- NaturalQuestions
- TriviaQA
Benchmarks
- math reasoning (GSM8K, MAWPS, SVAMP)
- Wikipedia QA (HotpotQA, WQ, NQ, TriviaQA)

