VERDICT: unify diversification and verification to produce grounded clarifications in RAG

February 14, 20257 min

Overview

Decision SnapshotReady For Pilot

Method reduces retrieval and large-context LLM calls and ties each interpretation to a supporting passage; validated on ASQA and ablated across model sizes, but depends on retriever and LLM quality.

Citations1

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Youngwon Lee, Seung-won Hwang, Ruofan Wu, Feng Yan, Danmei Xu, Moutasem Akkad, Zhewei Yao, Yuxiong He

Links

Abstract / PDF

Why It Matters For Business

Cuts repeated search calls and produces clarifications you can cite, improving enterprise search accuracy and user trust while lowering retrieval cost.

Who Should Care

Summary TLDR

The paper introduces VERDICT, a pipeline that integrates retrieval feedback and LLM execution feedback to generate disambiguated, verifiable sub-questions for vague queries. Instead of first making multiple interpretations and then verifying them, VERDICT retrieves a high-recall set of passages once, asks the LLM to propose only interpretations that each passage can answer, and clusters results to remove noise. On the ASQA benchmark (Wikipedia) VERDICT raises grounded F1 by ~23% on average versus strong Diversify-then-Verify baselines and reduces redundant retrieval and long-context LLM calls.

Problem Statement

Ambiguous short queries cause RAG systems to generate many ungrounded interpretations. The common Diversify-then-Verify (DtV) pipeline first creates interpretations from the LLM and then retrieves for each, causing wasted retrieval, noisy passages, cascading errors, and high verification cost. The paper aims to produce diverse interpretations that are grounded (verifiable) while cutting retrieval and verification overhead.

Main Contribution

VERDICT: a unified 'verified diversification' pipeline that conditions interpretation generation on a single retrieved universe and generator execution feedback.

Consolidation: cluster-based aggregation of per-passage interpretation-answer pairs to denoise and deduplicate without extra retrieval.

Key Findings

VERDICT yields large average grounded-F1 gains vs strong baselines.

Numbersavg +23% G-F1 across backbone LLMs

Practical UseUse VERDICT to raise verifiable disambiguation quality when you care about grounded answers.

Evidence RefAbstract; Sec.1; Sec.5

VERDICT needs only one retriever call per query instead of one per interpretation.

Numbersretriever calls: O(1) vs O(|interpretations|); k tuned to 20

Practical UseExpect lower retrieval cost and fewer search queries; scale retrieval once per user query.

Evidence RefSec.4.1; Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
G-F1 (grounded F1) on ASQA70.82 (VERDICT, GPT-4o)41.79 (DtV, GPT-4o)+29.03 ptsASQA devTable 2 (GPT-4o rows)Table 2
G-F1 (grounded F1) on ASQA67.80 (VERDICT, LLaMA 70B)48.00 (DtV, LLaMA 70B)+19.80 ptsASQA devTable 2 (LLaMA 70B rows)Table 2

What To Try In 7 Days

Implement a single high-recall retrieval pass (top-k, k≈20) and run LLM per passage to extract question-answer pairs.

Cluster the per-passage Q&A embeddings (HDBSCAN) and pick medoids to deduplicate clarifications.

Tune clustering to reach a business-preferred precision/recall trade-off (conservative vs default).

Agent Features

Tool Use
retrieverLLM generatorembedding encoder
Frameworks
VERDICT
Is Agentic

Yes

Architectures
single-agent pipeline

Optimization Features

Token Efficiency
avoid long multi-passage LLM inputs
Infra Optimization
reduced retriever queries; fewer large-model long-context runs
System Optimization
use embedding encoder reused from retrieval for clustering
Inference Optimization
single retrieval per queryper-passage LLM calls to minimize context lengthparallelizable passage-level LLM inferences

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Method assumes a reasonably good retriever; poor retrieval coverage breaks verified diversification.

Not evaluated under adversarial or extremely degraded retriever/generator behaviors.

When Not To Use

If your retriever cannot return relevant passages in top-k for typical queries.

When you cannot afford many per-passage LLM calls (compute-constrained environments).

Failure Modes

Missing interpretation if correct passage is outside top-k retrieval.

Smaller LLMs struggle to detect unanswerable passages (higher interpretation error rate for 8B models).

Core Entities

Models

LLaMA 3.1 8BLLaMA 3.3 70BGPT-4ogte-Qwen2-7B-instructarctic-embed

Metrics

G-PrecisionG-RecallG-F1Ungrounded RecallAverage # interpretations (|Q|)

Datasets

ASQAAmbigNQWikipedia (corpus used in ASQA)

Benchmarks

ASQA