Train a single LLM to ask itself clarifying questions and point to exact paragraphs to solve multi‑step questions in 128K contexts

February 21, 20259 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Yufan Zhuang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Jingbo Shang, Zicheng Liu, Emad Barsoum

Links

Abstract / PDF

Why It Matters For Business

AgenticLU cuts long-document QA failures by teaching one model to ask clarifying questions and point to exact paragraphs, yielding large accuracy gains with small inference overhead and reasonable finetuning cost.

Summary TLDR

AgenticLU teaches one LLM to run a short internal agentic workflow called Chain-of-Clarifications (CoC): generate clarifying questions, point back to paragraph indexes (pointback), and answer the original question. The authors collect CoC traces by doing a tree search (branching factor 8, depth up to 3), distill those traces via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO), and produce an AgenticLU-8B model that keeps short-context skills while substantially improving accuracy on 7 long-context tasks up to 128K tokens. Key wins: large gains on multi-hop tasks (e.g., HotpotQA 128K: 71.1% vs 40.0%), 97.8% recall of correct answers on NarrativeQA during tree search, and

Problem Statement

LLMs can accept very long inputs but often fail to find and integrate the few paragraphs needed to answer complex, multi-step questions. Simply increasing token capacity does not ensure the model will 'use' the right parts of the context. The paper asks: can a single model learn to self‑clarify and point back to relevant paragraphs so it reliably retrieves and reasons over long documents?

Main Contribution

Chain-of-Clarifications (CoC): an agentic workflow where the model raises clarification questions, pointbacks to paragraph indexes, answers its clarifications, and then answers the original query.

A test-time tree-search data collection method (branching=8, depth≤3) that produces 107.5K CoC traces from NarrativeQA for finetuning.

A two-stage distillation recipe: supervised fine-tuning (SFT) on CoC traces followed by Direct Preference Optimization (DPO) using judged preference pairs.

Empirical results across 7 long-context tasks (8K–128K tokens) showing large accuracy gains and modest runtime overhead via prefix caching.

Key Findings

AgenticLU-8B raises HotpotQA (128K) accuracy from 40.0% to 71.1%.

NumbersHotpotQA 128K: base 40.0% → AgenticLU 71.1% (+31.1 pts).

AgenticLU improves average long-context accuracy by ~14.7 percentage points over the base Llama3.1-8B.

NumbersLong Avg δ = +14.7 (Table 2).

Tree-search CoC path construction retrieves the correct answer in NarrativeQA with very high recall during data collection.

NumbersNarrativeQA CoC recall = 97.8% (depth ≤3, branching=8).

The main runtime overhead at inference is small when using prefix KV caching.

NumbersRuntime Overhead ~101.93% vs baseline 100% (Table 5).

Most questions resolve with one clarification round; multiple rounds give diminishing but measurable gains.

Numbers92% solved in 1 round; two rounds solve 53% of remaining 8%; three rounds solve 35% of remaining 4% (Section 5.1).

Results

Accuracy

Value71.1%

Baseline40.0% (Llama3.1-8B)

Accuracy

Value56.0%

Baseline38.0% (Llama3.1-8B)

Accuracy

ValueLong Avg +14.7 pp

BaselineLlama3.1-8B

NarrativeQA CoC path recall (data collection)

Value97.8% recall

Runtime overhead (inference)

Value101.93%

Baseline100% (baseline direct answering)

Avg tokens generated per round

Value1205.38 (AgenticLU)

Baseline76.28 (baseline)

Who Should Care

What To Try In 7 Days

Prompt your current LLM to generate one clarifying question per query and ask it to return paragraph indexes (pointback) to see immediate accuracy shifts.

Collect a small set of high-quality CoC traces on your domain and run SFT on an 8B model to test whether one-pass pointback helps retrieval.

Measure effective vs nominal context: add irrelevant tokens and track accuracy drop to quantify your model's effective context window.

Agent Features

Memory

  • supports up to 128K token inputs
  • learns to reference paragraph indexes instead of reading entire doc each time

Planning

  • Chain-of-Clarifications (CoC)
  • multi-step clarification planning

Tool Use

  • pointback (paragraph index retrieval)
  • GPT-4o as automated judge (in data selection)

Frameworks

  • SFT
  • DPO
  • tree-search trace collection

Is Agentic

true

Architectures

  • single LLM agent (no external agents)
  • tree-search during data collection

Collaboration

  • single-agent internal planning (no multi-agent coordination)

Optimization Features

Token Efficiency

  • amortize expensive multi-round search into training so single-pass inference produces pointbacks

Infra Optimization

  • DeepSpeed across AMD MI250 GPUs

System Optimization

  • FlashAttention-2
  • Ring Attention
  • vLLM

Training Optimization

  • SFT
  • direct preference optimization (DPO) on preference pairs

Inference Optimization

  • prefix KV caching to reduce repeated attention cost
  • distillation of multi-round behavior into one-pass generation

Reproducibility

Data Urls

  • NarrativeQA (public dataset)
  • HELMET benchmark (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Model uses a fixed maximum number of clarification rounds and cannot yet decide dynamically when to stop.
  • Data collection requires heavy compute (tree-search across 128K contexts and iterative paragraph relevance checks).
  • The pointback mechanism can fail if the model generates wrong paragraph indexes after finetuning.

When Not To Use

  • If you cannot afford the upfront compute to generate CoC traces and run SFT+DPO.
  • When a dynamic stopping policy for clarifications is required but not implemented.
  • If you need zero dependence on an external judge during data curation (the pipeline used GPT variants for selection).

Failure Modes

  • Omitting self-clarification or pointback causes large accuracy drops (≥10 pp, Table 4).
  • Judge bias from GPT-4o used in trace selection could favor certain reasoning styles.
  • Model may overgenerate paragraph indexes, increasing token output and downstream cost.

Core Entities

Models

  • Llama3.1-8B-Instruct
  • AgenticLU-8B
  • ProLong-8B
  • Llama3.1-8B

Metrics

  • Accuracy
  • ROUGE-L
  • runtime overhead
  • tokens generated

Datasets

  • NarrativeQA
  • HotpotQA
  • Natural Questions
  • TriviaQA
  • PopQA
  • InfiniteBench
  • Helmet (HELMET benchmark)

Benchmarks

  • HELMET
  • InfiniteBench
  • NarrativeQA

Context Entities

Models

  • Llama3
  • ProLong-8B-512K

Metrics

  • Accuracy

Datasets

  • HELMET
  • ARC
  • GSM8K
  • MMLU

Benchmarks

  • Short-context tasks (ARC, GSM8K, MMLU)