ACE: an agentic Retrieve‑or‑Think loop that keeps context concise and boosts multi-hop QA accuracy

January 13, 20266 min

Overview

Decision SnapshotNeeds Validation

ACE shows consistent accuracy gains on multi-hop QA and lowers token use versus brute-force iterative retrieval, but it needs orchestration, step tuning, and can increase tokens over single-step RAG.

Citations0

Evidence Strength0.70

Confidence0.70

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Rubing Chen, Jian Wang, Wenjie Li, Xiao-Yong Wei, Qing Li

Links

Abstract / PDF / Data

Why It Matters For Business

ACE gives higher accuracy on complex question answering while avoiding many costly retrieval calls; this can reduce cloud costs and improve product accuracy for knowledge-intensive features.

Who Should Care

Summary TLDR

The paper introduces ACE, a multi-agent framework that decides at each step whether to retrieve external documents or to 'think' (reason with current context). A central orchestrator uses majority voting to choose between a retriever agent and a reasoner agent. On three multi-hop QA benchmarks (MultiHop-RAG, HotpotQA, 2WikiQA) with LLaMA-3-18B-Instruct, ACE raises accuracy (e.g., HotpotQA 62.8% vs RAG 38.9%) and cuts token cost versus a brute-force iterative baseline (MultiHop-RAG tokens 10,653 vs 18,196 for IterDRAG). ACE needs tuning of max steps (N) because too many iterations can drop accuracy.

Problem Statement

Current retrieval-augmented systems retrieve at every step and often bloat context with irrelevant material. This wastes tokens, slows inference, and harms multi-hop reasoning. We need a dynamic controller that selectively retrieves only when needed and otherwise refines internal reasoning.

Main Contribution

Propose context evolution: alternate deliberate retrieve-or-think steps instead of blind retrieval at every step.

Design ACE: a multi-agent loop with a central orchestrator that majority-votes to invoke a retriever or a reasoner.

Key Findings

Large accuracy gains on HotpotQA compared to single-step RAG.

NumbersHotpotQA Acc ACE 62.8% vs RAG 38.9% (+23.9 pp)

Practical UseIf you handle multi-hop questions, replace single-step RAG with ACE-style selective retrieval to substantially improve answer correctness on similar benchmarks.

Evidence RefTable 1

Substantial accuracy improvement on 2WikiQA over RAG.

Numbers2WikiQA Acc ACE 47.9% vs RAG 28.8% (+19.1 pp)

Practical UseACE's retrieve-or-think loop helps recover multi-hop links missing from single-shot retrieval; try it for tasks needing evidence chaining.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy57.9%RAG 49.2%+8.7 ppMultiHop-RAGTable 1 reports ACE 57.9% vs RAG 49.2%Table 1
Avg. Tokens10,653IterDRAG 18,196-41.4%MultiHop-RAGTable 1 tokens: ACE 10,653, IterDRAG 18,196Table 1

What To Try In 7 Days

Run ACE-style controller with your existing retriever and LLM on a small multi-hop subset.

Add a simple majority-vote orchestrator that picks RETRIEVE or THINK per step.

Sweep the max-step N to find the sweet spot for accuracy vs cost on your data.

Agent Features

Memory
working memory M_i (accumulated contexts and thoughts)
Planning
interleaved retrieve-or-think loopmajority-vote decision
Tool Use
retriever agent (external docs)reasoner agent (internal sub-queries)
Frameworks
ACE
Is Agentic

Yes

Architectures
multi-agent orchestrator
Collaboration
committee voting among agents

Optimization Features

Token Efficiency
avoids brute-force iterative retrieval; fewer tokens vs IterDRAG on tested sets
Inference Optimization
reduces redundant retrieval calls to save tokens

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

MultiHop-RAGHotpotQA2WikiQA

Risks & Boundaries

Limitations

ACE uses more tokens than single-step RAG; higher latency/cost in some cases.

Requires tuning max iterations (N) per dataset to avoid performance drops.

When Not To Use

When minimal latency or token cost is the top priority over accuracy.

For simple single-hop lookups where single-step retrieval suffices.

Failure Modes

Excessive iterations can introduce distracting info and lower accuracy.

Orchestrator majority vote can be wrong and lead to unnecessary retrievals or missed evidence.

Core Entities

Models

LLaMA-3-18B-Instruct

Metrics

AccuracyAverage Token Consumption

Datasets

MultiHop-RAGHotpotQA2WikiQA

Benchmarks

MultiHop-RAGHotpotQA2WikiQA