Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
Cuts repeated search calls and produces clarifications you can cite, improving enterprise search accuracy and user trust while lowering retrieval cost.
Summary TLDR
The paper introduces VERDICT, a pipeline that integrates retrieval feedback and LLM execution feedback to generate disambiguated, verifiable sub-questions for vague queries. Instead of first making multiple interpretations and then verifying them, VERDICT retrieves a high-recall set of passages once, asks the LLM to propose only interpretations that each passage can answer, and clusters results to remove noise. On the ASQA benchmark (Wikipedia) VERDICT raises grounded F1 by ~23% on average versus strong Diversify-then-Verify baselines and reduces redundant retrieval and long-context LLM calls.
Problem Statement
Ambiguous short queries cause RAG systems to generate many ungrounded interpretations. The common Diversify-then-Verify (DtV) pipeline first creates interpretations from the LLM and then retrieves for each, causing wasted retrieval, noisy passages, cascading errors, and high verification cost. The paper aims to produce diverse interpretations that are grounded (verifiable) while cutting retrieval and verification overhead.
Main Contribution
VERDICT: a unified 'verified diversification' pipeline that conditions interpretation generation on a single retrieved universe and generator execution feedback.
Consolidation: cluster-based aggregation of per-passage interpretation-answer pairs to denoise and deduplicate without extra retrieval.
Empirical gains on ASQA: improves grounded F1 (verifiability) substantially and reduces retrieval/LLM costs compared to DtV baselines; code and evaluation framework released.
Key Findings
VERDICT yields large average grounded-F1 gains vs strong baselines.
VERDICT needs only one retriever call per query instead of one per interpretation.
VERDICT produces higher grounded precision and recall on ASQA.
VERDICT matches or exceeds human-level diversity while staying grounded.
Clustering (consolidation) controls precision/recall trade-off.
Results
G-F1 (grounded F1) on ASQA
G-F1 (grounded F1) on ASQA
Average # interpretations per query
Who Should Care
What To Try In 7 Days
Implement a single high-recall retrieval pass (top-k, k≈20) and run LLM per passage to extract question-answer pairs.
Cluster the per-passage Q&A embeddings (HDBSCAN) and pick medoids to deduplicate clarifications.
Tune clustering to reach a business-preferred precision/recall trade-off (conservative vs default).
Agent Features
Tool Use
- retriever
- LLM generator
- embedding encoder
Frameworks
- VERDICT
Is Agentic
true
Architectures
- single-agent pipeline
Optimization Features
Token Efficiency
- avoid long multi-passage LLM inputs
Infra Optimization
- reduced retriever queries; fewer large-model long-context runs
System Optimization
- use embedding encoder reused from retrieval for clustering
Inference Optimization
- single retrieval per query
- per-passage LLM calls to minimize context length
- parallelizable passage-level LLM inferences
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Method assumes a reasonably good retriever; poor retrieval coverage breaks verified diversification.
- Not evaluated under adversarial or extremely degraded retriever/generator behaviors.
- Main benchmark is ASQA (Wikipedia); enterprise corpora may need retriever tuning and prompt adjustments.
When Not To Use
- If your retriever cannot return relevant passages in top-k for typical queries.
- When you cannot afford many per-passage LLM calls (compute-constrained environments).
- If adversarial or highly noisy corpora make consolidation unreliable.
Failure Modes
- Missing interpretation if correct passage is outside top-k retrieval.
- Smaller LLMs struggle to detect unanswerable passages (higher interpretation error rate for 8B models).
- Clustering may merge distinct intents if embeddings fail to separate them.
Core Entities
Models
- LLaMA 3.1 8B
- LLaMA 3.3 70B
- GPT-4o
- gte-Qwen2-7B-instruct
- arctic-embed
Metrics
- G-Precision
- G-Recall
- G-F1
- Ungrounded Recall
- Average # interpretations (|Q|)
Datasets
- ASQA
- AmbigNQ
- Wikipedia (corpus used in ASQA)
Benchmarks
- ASQA

