Overview
The idea is practical: collect expensive traces once with a tree search, then train a model to do the retrieval+clarification in one pass; evidence shows strong, reproducible gains on 7 benchmarks up to 128K tokens.
Citations0
Evidence Strength0.85
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
AgenticLU cuts long-document QA failures by teaching one model to ask clarifying questions and point to exact paragraphs, yielding large accuracy gains with small inference overhead and reasonable finetuning cost.
Who Should Care
Summary TLDR
AgenticLU teaches one LLM to run a short internal agentic workflow called Chain-of-Clarifications (CoC): generate clarifying questions, point back to paragraph indexes (pointback), and answer the original question. The authors collect CoC traces by doing a tree search (branching factor 8, depth up to 3), distill those traces via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO), and produce an AgenticLU-8B model that keeps short-context skills while substantially improving accuracy on 7 long-context tasks up to 128K tokens. Key wins: large gains on multi-hop tasks (e.g., HotpotQA 128K: 71.1% vs 40.0%), 97.8% recall of correct answers on NarrativeQA during tree search, and
Problem Statement
LLMs can accept very long inputs but often fail to find and integrate the few paragraphs needed to answer complex, multi-step questions. Simply increasing token capacity does not ensure the model will 'use' the right parts of the context. The paper asks: can a single model learn to self‑clarify and point back to relevant paragraphs so it reliably retrieves and reasons over long documents?
Main Contribution
Chain-of-Clarifications (CoC): an agentic workflow where the model raises clarification questions, pointbacks to paragraph indexes, answers its clarifications, and then answers the original query.
A test-time tree-search data collection method (branching=8, depth≤3) that produces 107.5K CoC traces from NarrativeQA for finetuning.
Key Findings
AgenticLU-8B raises HotpotQA (128K) accuracy from 40.0% to 71.1%.
AgenticLU improves average long-context accuracy by ~14.7 percentage points over the base Llama3.1-8B.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 71.1% | 40.0% (Llama3.1-8B) | +31.1 pp | HotpotQA at 128K | AgenticLU-8B achieves 71.1% vs 40.0% for base | Table 10 |
| Accuracy | 56.0% | 38.0% (Llama3.1-8B) | +18.0 pp | NarrativeQA at 128K | AgenticLU-8B 56% vs base 38% | Table 14 |
What To Try In 7 Days
Prompt your current LLM to generate one clarifying question per query and ask it to return paragraph indexes (pointback) to see immediate accuracy shifts.
Collect a small set of high-quality CoC traces on your domain and run SFT on an 8B model to test whether one-pass pointback helps retrieval.
Measure effective vs nominal context: add irrelevant tokens and track accuracy drop to quantify your model's effective context window.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Model uses a fixed maximum number of clarification rounds and cannot yet decide dynamically when to stop.
Data collection requires heavy compute (tree-search across 128K contexts and iterative paragraph relevance checks).
When Not To Use
If you cannot afford the upfront compute to generate CoC traces and run SFT+DPO.
When a dynamic stopping policy for clarifications is required but not implemented.
Failure Modes
Omitting self-clarification or pointback causes large accuracy drops (≥10 pp, Table 4).
Judge bias from GPT-4o used in trace selection could favor certain reasoning styles.

