Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
AgenticLU cuts long-document QA failures by teaching one model to ask clarifying questions and point to exact paragraphs, yielding large accuracy gains with small inference overhead and reasonable finetuning cost.
Summary TLDR
AgenticLU teaches one LLM to run a short internal agentic workflow called Chain-of-Clarifications (CoC): generate clarifying questions, point back to paragraph indexes (pointback), and answer the original question. The authors collect CoC traces by doing a tree search (branching factor 8, depth up to 3), distill those traces via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO), and produce an AgenticLU-8B model that keeps short-context skills while substantially improving accuracy on 7 long-context tasks up to 128K tokens. Key wins: large gains on multi-hop tasks (e.g., HotpotQA 128K: 71.1% vs 40.0%), 97.8% recall of correct answers on NarrativeQA during tree search, and
Problem Statement
LLMs can accept very long inputs but often fail to find and integrate the few paragraphs needed to answer complex, multi-step questions. Simply increasing token capacity does not ensure the model will 'use' the right parts of the context. The paper asks: can a single model learn to self‑clarify and point back to relevant paragraphs so it reliably retrieves and reasons over long documents?
Main Contribution
Chain-of-Clarifications (CoC): an agentic workflow where the model raises clarification questions, pointbacks to paragraph indexes, answers its clarifications, and then answers the original query.
A test-time tree-search data collection method (branching=8, depth≤3) that produces 107.5K CoC traces from NarrativeQA for finetuning.
A two-stage distillation recipe: supervised fine-tuning (SFT) on CoC traces followed by Direct Preference Optimization (DPO) using judged preference pairs.
Empirical results across 7 long-context tasks (8K–128K tokens) showing large accuracy gains and modest runtime overhead via prefix caching.
Key Findings
AgenticLU-8B raises HotpotQA (128K) accuracy from 40.0% to 71.1%.
AgenticLU improves average long-context accuracy by ~14.7 percentage points over the base Llama3.1-8B.
Tree-search CoC path construction retrieves the correct answer in NarrativeQA with very high recall during data collection.
The main runtime overhead at inference is small when using prefix KV caching.
Most questions resolve with one clarification round; multiple rounds give diminishing but measurable gains.
Results
Accuracy
Accuracy
Accuracy
NarrativeQA CoC path recall (data collection)
Runtime overhead (inference)
Avg tokens generated per round
Who Should Care
What To Try In 7 Days
Prompt your current LLM to generate one clarifying question per query and ask it to return paragraph indexes (pointback) to see immediate accuracy shifts.
Collect a small set of high-quality CoC traces on your domain and run SFT on an 8B model to test whether one-pass pointback helps retrieval.
Measure effective vs nominal context: add irrelevant tokens and track accuracy drop to quantify your model's effective context window.
Agent Features
Memory
- supports up to 128K token inputs
- learns to reference paragraph indexes instead of reading entire doc each time
Planning
- Chain-of-Clarifications (CoC)
- multi-step clarification planning
Tool Use
- pointback (paragraph index retrieval)
- GPT-4o as automated judge (in data selection)
Frameworks
- SFT
- DPO
- tree-search trace collection
Is Agentic
true
Architectures
- single LLM agent (no external agents)
- tree-search during data collection
Collaboration
- single-agent internal planning (no multi-agent coordination)
Optimization Features
Token Efficiency
- amortize expensive multi-round search into training so single-pass inference produces pointbacks
Infra Optimization
- DeepSpeed across AMD MI250 GPUs
System Optimization
- FlashAttention-2
- Ring Attention
- vLLM
Training Optimization
- SFT
- direct preference optimization (DPO) on preference pairs
Inference Optimization
- prefix KV caching to reduce repeated attention cost
- distillation of multi-round behavior into one-pass generation
Reproducibility
Data Urls
- NarrativeQA (public dataset)
- HELMET benchmark (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Model uses a fixed maximum number of clarification rounds and cannot yet decide dynamically when to stop.
- Data collection requires heavy compute (tree-search across 128K contexts and iterative paragraph relevance checks).
- The pointback mechanism can fail if the model generates wrong paragraph indexes after finetuning.
When Not To Use
- If you cannot afford the upfront compute to generate CoC traces and run SFT+DPO.
- When a dynamic stopping policy for clarifications is required but not implemented.
- If you need zero dependence on an external judge during data curation (the pipeline used GPT variants for selection).
Failure Modes
- Omitting self-clarification or pointback causes large accuracy drops (≥10 pp, Table 4).
- Judge bias from GPT-4o used in trace selection could favor certain reasoning styles.
- Model may overgenerate paragraph indexes, increasing token output and downstream cost.
Core Entities
Models
- Llama3.1-8B-Instruct
- AgenticLU-8B
- ProLong-8B
- Llama3.1-8B
Metrics
- Accuracy
- ROUGE-L
- runtime overhead
- tokens generated
Datasets
- NarrativeQA
- HotpotQA
- Natural Questions
- TriviaQA
- PopQA
- InfiniteBench
- Helmet (HELMET benchmark)
Benchmarks
- HELMET
- InfiniteBench
- NarrativeQA
Context Entities
Models
- Llama3
- ProLong-8B-512K
Metrics
- Accuracy
Datasets
- HELMET
- ARC
- GSM8K
- MMLU
Benchmarks
- Short-context tasks (ARC, GSM8K, MMLU)

