Train a single LLM to ask itself clarifying questions and point to exact paragraphs to solve multi‑step questions in 128K contexts

Overview

Decision SnapshotReady For Pilot

The idea is practical: collect expensive traces once with a tree search, then train a model to do the retrieval+clarification in one pass; evidence shows strong, reproducible gains on 7 benchmarks up to 128K tokens.

Citations0

Evidence Strength0.85

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Yufan Zhuang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Jingbo Shang, Zicheng Liu, Emad Barsoum

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AgenticLU cuts long-document QA failures by teaching one model to ask clarifying questions and point to exact paragraphs, yielding large accuracy gains with small inference overhead and reasonable finetuning cost.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

AgenticLU teaches one LLM to run a short internal agentic workflow called Chain-of-Clarifications (CoC): generate clarifying questions, point back to paragraph indexes (pointback), and answer the original question. The authors collect CoC traces by doing a tree search (branching factor 8, depth up to 3), distill those traces via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO), and produce an AgenticLU-8B model that keeps short-context skills while substantially improving accuracy on 7 long-context tasks up to 128K tokens. Key wins: large gains on multi-hop tasks (e.g., HotpotQA 128K: 71.1% vs 40.0%), 97.8% recall of correct answers on NarrativeQA during tree search, and

Problem Statement

LLMs can accept very long inputs but often fail to find and integrate the few paragraphs needed to answer complex, multi-step questions. Simply increasing token capacity does not ensure the model will 'use' the right parts of the context. The paper asks: can a single model learn to self‑clarify and point back to relevant paragraphs so it reliably retrieves and reasons over long documents?

Main Contribution

Chain-of-Clarifications (CoC): an agentic workflow where the model raises clarification questions, pointbacks to paragraph indexes, answers its clarifications, and then answers the original query.

A test-time tree-search data collection method (branching=8, depth≤3) that produces 107.5K CoC traces from NarrativeQA for finetuning.

Key Findings

AgenticLU-8B raises HotpotQA (128K) accuracy from 40.0% to 71.1%.

NumbersHotpotQA 128K: base 40.0% → AgenticLU 71.1% (+31.1 pts).

Practical UseIf you finetune with CoC traces, expect large multi-hop accuracy gains on noisy long documents; apply when retrieval of a few paragraphs matters.

Evidence RefTable 10; Table 3

AgenticLU improves average long-context accuracy by ~14.7 percentage points over the base Llama3.1-8B.

NumbersLong Avg δ = +14.7 (Table 2).

Practical UseFinetuning with CoC traces yields broad gains across diverse long-context tasks, not just one dataset—use for general long-document QA.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	71.1%	40.0% (Llama3.1-8B)	+31.1 pp	HotpotQA at 128K	AgenticLU-8B achieves 71.1% vs 40.0% for base	Table 10
Accuracy	56.0%	38.0% (Llama3.1-8B)	+18.0 pp	NarrativeQA at 128K	AgenticLU-8B 56% vs base 38%	Table 14

What To Try In 7 Days

Prompt your current LLM to generate one clarifying question per query and ask it to return paragraph indexes (pointback) to see immediate accuracy shifts.

Collect a small set of high-quality CoC traces on your domain and run SFT on an 8B model to test whether one-pass pointback helps retrieval.

Measure effective vs nominal context: add irrelevant tokens and track accuracy drop to quantify your model's effective context window.

Agent Features

Memory

supports up to 128K token inputslearns to reference paragraph indexes instead of reading entire doc each time

Planning

Chain-of-Clarifications (CoC)multi-step clarification planning

Tool Use

pointback (paragraph index retrieval)GPT-4o as automated judge (in data selection)

Frameworks

SFTDPOtree-search trace collection

Is Agentic

Yes

Architectures

single LLM agent (no external agents)tree-search during data collection

Collaboration

single-agent internal planning (no multi-agent coordination)

Optimization Features

Token Efficiency

amortize expensive multi-round search into training so single-pass inference produces pointbacks

Infra Optimization

DeepSpeed across AMD MI250 GPUs

System Optimization

FlashAttention-2Ring AttentionvLLM

Training Optimization

SFTdirect preference optimization (DPO) on preference pairs

Inference Optimization

prefix KV caching to reduce repeated attention costdistillation of multi-round behavior into one-pass generation

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/EvanZhuang/AgenticLU

Data URLs

NarrativeQA (public dataset)HELMET benchmark (public)

Risks & Boundaries

Limitations

Model uses a fixed maximum number of clarification rounds and cannot yet decide dynamically when to stop.

Data collection requires heavy compute (tree-search across 128K contexts and iterative paragraph relevance checks).

When Not To Use

If you cannot afford the upfront compute to generate CoC traces and run SFT+DPO.

When a dynamic stopping policy for clarifications is required but not implemented.

Failure Modes

Omitting self-clarification or pointback causes large accuracy drops (≥10 pp, Table 4).

Judge bias from GPT-4o used in trace selection could favor certain reasoning styles.

Core Entities

Models

Llama3.1-8B-InstructAgenticLU-8BProLong-8BLlama3.1-8B

Metrics

AccuracyROUGE-Lruntime overheadtokens generated

Datasets

NarrativeQAHotpotQANatural QuestionsTriviaQAPopQAInfiniteBenchHelmet (HELMET benchmark)

Benchmarks

HELMETInfiniteBenchNarrativeQA

Context Entities

Models

Llama3ProLong-8B-512K

Metrics

Accuracy

Datasets

HELMETARCGSM8KMMLU

Benchmarks

Short-context tasks (ARC, GSM8K, MMLU)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AgenticLU-8B raises HotpotQA (128K) accuracy from 40.0% to 71.1%.

AgenticLU improves average long-context accuracy by ~14.7 percentage points over the base Llama3.1-8B.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding