ACE: an agentic Retrieve‑or‑Think loop that keeps context concise and boosts multi-hop QA accuracy

January 13, 20266 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Rubing Chen, Jian Wang, Wenjie Li, Xiao-Yong Wei, Qing Li

Links

Abstract / PDF

Why It Matters For Business

ACE gives higher accuracy on complex question answering while avoiding many costly retrieval calls; this can reduce cloud costs and improve product accuracy for knowledge-intensive features.

Summary TLDR

The paper introduces ACE, a multi-agent framework that decides at each step whether to retrieve external documents or to 'think' (reason with current context). A central orchestrator uses majority voting to choose between a retriever agent and a reasoner agent. On three multi-hop QA benchmarks (MultiHop-RAG, HotpotQA, 2WikiQA) with LLaMA-3-18B-Instruct, ACE raises accuracy (e.g., HotpotQA 62.8% vs RAG 38.9%) and cuts token cost versus a brute-force iterative baseline (MultiHop-RAG tokens 10,653 vs 18,196 for IterDRAG). ACE needs tuning of max steps (N) because too many iterations can drop accuracy.

Problem Statement

Current retrieval-augmented systems retrieve at every step and often bloat context with irrelevant material. This wastes tokens, slows inference, and harms multi-hop reasoning. We need a dynamic controller that selectively retrieves only when needed and otherwise refines internal reasoning.

Main Contribution

Propose context evolution: alternate deliberate retrieve-or-think steps instead of blind retrieval at every step.

Design ACE: a multi-agent loop with a central orchestrator that majority-votes to invoke a retriever or a reasoner.

Show empirical gains on three multi-hop QA sets: higher accuracy and lower token use vs naive iterative baselines.

Key Findings

Large accuracy gains on HotpotQA compared to single-step RAG.

NumbersHotpotQA Acc ACE 62.8% vs RAG 38.9% (+23.9 pp)

Substantial accuracy improvement on 2WikiQA over RAG.

Numbers2WikiQA Acc ACE 47.9% vs RAG 28.8% (+19.1 pp)

Moderate accuracy uplift on MultiHop-RAG versus RAG.

NumbersMultiHop-RAG Acc ACE 57.9% vs RAG 49.2% (+8.7 pp)

Token efficiency vs brute-force iterative baseline.

NumbersMultiHop-RAG tokens ACE 10,653 vs IterDRAG 18,196 (≈-41% tokens)

Results

Accuracy

Value57.9%

BaselineRAG 49.2%

Avg. Tokens

Value10,653

BaselineIterDRAG 18,196

Accuracy

Value62.8%

BaselineRAG 38.9%

Avg. Tokens

Value3,271

BaselineIterDRAG 723

Accuracy

Value47.9%

BaselineRAG 28.8%

Avg. Tokens

Value2,945

BaselineIterDRAG 9,760

Who Should Care

What To Try In 7 Days

Run ACE-style controller with your existing retriever and LLM on a small multi-hop subset.

Add a simple majority-vote orchestrator that picks RETRIEVE or THINK per step.

Sweep the max-step N to find the sweet spot for accuracy vs cost on your data.

Agent Features

Memory

  • working memory M_i (accumulated contexts and thoughts)

Planning

  • interleaved retrieve-or-think loop
  • majority-vote decision

Tool Use

  • retriever agent (external docs)
  • reasoner agent (internal sub-queries)

Frameworks

  • ACE

Is Agentic

true

Architectures

  • multi-agent orchestrator

Collaboration

  • committee voting among agents

Optimization Features

Token Efficiency

  • avoids brute-force iterative retrieval; fewer tokens vs IterDRAG on tested sets

Inference Optimization

  • reduces redundant retrieval calls to save tokens

Reproducibility

Data Urls

  • MultiHop-RAG
  • HotpotQA
  • 2WikiQA

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • ACE uses more tokens than single-step RAG; higher latency/cost in some cases.
  • Requires tuning max iterations (N) per dataset to avoid performance drops.
  • Evaluation limited to three multi-hop QA datasets and one LLM backbone.

When Not To Use

  • When minimal latency or token cost is the top priority over accuracy.
  • For simple single-hop lookups where single-step retrieval suffices.
  • If you lack an indexed external corpus to retrieve from.

Failure Modes

  • Excessive iterations can introduce distracting info and lower accuracy.
  • Orchestrator majority vote can be wrong and lead to unnecessary retrievals or missed evidence.
  • Wrong retrieved documents still pollute working memory and mislead reasoning.

Core Entities

Models

  • LLaMA-3-18B-Instruct

Metrics

  • Accuracy
  • Average Token Consumption

Datasets

  • MultiHop-RAG
  • HotpotQA
  • 2WikiQA

Benchmarks

  • MultiHop-RAG
  • HotpotQA
  • 2WikiQA