Train a single LLM to ask itself clarifying questions and point to exact paragraphs to solve multi‑step questions in 128K contexts

February 21, 20259 min

Overview

Decision SnapshotReady For Pilot

The idea is practical: collect expensive traces once with a tree search, then train a model to do the retrieval+clarification in one pass; evidence shows strong, reproducible gains on 7 benchmarks up to 128K tokens.

Citations0

Evidence Strength0.85

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Yufan Zhuang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Jingbo Shang, Zicheng Liu, Emad Barsoum

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AgenticLU cuts long-document QA failures by teaching one model to ask clarifying questions and point to exact paragraphs, yielding large accuracy gains with small inference overhead and reasonable finetuning cost.

Who Should Care

Summary TLDR

AgenticLU teaches one LLM to run a short internal agentic workflow called Chain-of-Clarifications (CoC): generate clarifying questions, point back to paragraph indexes (pointback), and answer the original question. The authors collect CoC traces by doing a tree search (branching factor 8, depth up to 3), distill those traces via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO), and produce an AgenticLU-8B model that keeps short-context skills while substantially improving accuracy on 7 long-context tasks up to 128K tokens. Key wins: large gains on multi-hop tasks (e.g., HotpotQA 128K: 71.1% vs 40.0%), 97.8% recall of correct answers on NarrativeQA during tree search, and

Problem Statement

LLMs can accept very long inputs but often fail to find and integrate the few paragraphs needed to answer complex, multi-step questions. Simply increasing token capacity does not ensure the model will 'use' the right parts of the context. The paper asks: can a single model learn to self‑clarify and point back to relevant paragraphs so it reliably retrieves and reasons over long documents?

Main Contribution

Chain-of-Clarifications (CoC): an agentic workflow where the model raises clarification questions, pointbacks to paragraph indexes, answers its clarifications, and then answers the original query.

A test-time tree-search data collection method (branching=8, depth≤3) that produces 107.5K CoC traces from NarrativeQA for finetuning.

Key Findings

AgenticLU-8B raises HotpotQA (128K) accuracy from 40.0% to 71.1%.

NumbersHotpotQA 128K: base 40.0% → AgenticLU 71.1% (+31.1 pts).

Practical UseIf you finetune with CoC traces, expect large multi-hop accuracy gains on noisy long documents; apply when retrieval of a few paragraphs matters.

Evidence RefTable 10; Table 3

AgenticLU improves average long-context accuracy by ~14.7 percentage points over the base Llama3.1-8B.

NumbersLong Avg δ = +14.7 (Table 2).

Practical UseFinetuning with CoC traces yields broad gains across diverse long-context tasks, not just one dataset—use for general long-document QA.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy71.1%40.0% (Llama3.1-8B)+31.1 ppHotpotQA at 128KAgenticLU-8B achieves 71.1% vs 40.0% for baseTable 10
Accuracy56.0%38.0% (Llama3.1-8B)+18.0 ppNarrativeQA at 128KAgenticLU-8B 56% vs base 38%Table 14

What To Try In 7 Days

Prompt your current LLM to generate one clarifying question per query and ask it to return paragraph indexes (pointback) to see immediate accuracy shifts.

Collect a small set of high-quality CoC traces on your domain and run SFT on an 8B model to test whether one-pass pointback helps retrieval.

Measure effective vs nominal context: add irrelevant tokens and track accuracy drop to quantify your model's effective context window.

Agent Features

Memory
supports up to 128K token inputslearns to reference paragraph indexes instead of reading entire doc each time
Planning
Chain-of-Clarifications (CoC)multi-step clarification planning
Tool Use
pointback (paragraph index retrieval)GPT-4o as automated judge (in data selection)
Frameworks
SFTDPOtree-search trace collection
Is Agentic

Yes

Architectures
single LLM agent (no external agents)tree-search during data collection
Collaboration
single-agent internal planning (no multi-agent coordination)

Optimization Features

Token Efficiency
amortize expensive multi-round search into training so single-pass inference produces pointbacks
Infra Optimization
DeepSpeed across AMD MI250 GPUs
System Optimization
FlashAttention-2Ring AttentionvLLM
Training Optimization
SFTdirect preference optimization (DPO) on preference pairs
Inference Optimization
prefix KV caching to reduce repeated attention costdistillation of multi-round behavior into one-pass generation

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

NarrativeQA (public dataset)HELMET benchmark (public)

Risks & Boundaries

Limitations

Model uses a fixed maximum number of clarification rounds and cannot yet decide dynamically when to stop.

Data collection requires heavy compute (tree-search across 128K contexts and iterative paragraph relevance checks).

When Not To Use

If you cannot afford the upfront compute to generate CoC traces and run SFT+DPO.

When a dynamic stopping policy for clarifications is required but not implemented.

Failure Modes

Omitting self-clarification or pointback causes large accuracy drops (≥10 pp, Table 4).

Judge bias from GPT-4o used in trace selection could favor certain reasoning styles.

Core Entities

Models

Llama3.1-8B-InstructAgenticLU-8BProLong-8BLlama3.1-8B

Metrics

AccuracyROUGE-Lruntime overheadtokens generated

Datasets

NarrativeQAHotpotQANatural QuestionsTriviaQAPopQAInfiniteBenchHelmet (HELMET benchmark)

Benchmarks

HELMETInfiniteBenchNarrativeQA

Context Entities

Models

Llama3ProLong-8B-512K

Metrics

Accuracy

Datasets

HELMETARCGSM8KMMLU

Benchmarks

Short-context tasks (ARC, GSM8K, MMLU)