VERDICT: unify diversification and verification to produce grounded clarifications in RAG

Overview

Decision SnapshotReady For Pilot

Method reduces retrieval and large-context LLM calls and ties each interpretation to a supporting passage; validated on ASQA and ablated across model sizes, but depends on retriever and LLM quality.

Citations1

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Youngwon Lee, Seung-won Hwang, Ruofan Wu, Feng Yan, Danmei Xu, Moutasem Akkad, Zhewei Yao, Yuxiong He

Links

Abstract / PDF

Why It Matters For Business

Cuts repeated search calls and produces clarifications you can cite, improving enterprise search accuracy and user trust while lowering retrieval cost.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead Data Scientist

Summary TLDR

The paper introduces VERDICT, a pipeline that integrates retrieval feedback and LLM execution feedback to generate disambiguated, verifiable sub-questions for vague queries. Instead of first making multiple interpretations and then verifying them, VERDICT retrieves a high-recall set of passages once, asks the LLM to propose only interpretations that each passage can answer, and clusters results to remove noise. On the ASQA benchmark (Wikipedia) VERDICT raises grounded F1 by ~23% on average versus strong Diversify-then-Verify baselines and reduces redundant retrieval and long-context LLM calls.

Problem Statement

Ambiguous short queries cause RAG systems to generate many ungrounded interpretations. The common Diversify-then-Verify (DtV) pipeline first creates interpretations from the LLM and then retrieves for each, causing wasted retrieval, noisy passages, cascading errors, and high verification cost. The paper aims to produce diverse interpretations that are grounded (verifiable) while cutting retrieval and verification overhead.

Main Contribution

VERDICT: a unified 'verified diversification' pipeline that conditions interpretation generation on a single retrieved universe and generator execution feedback.

Consolidation: cluster-based aggregation of per-passage interpretation-answer pairs to denoise and deduplicate without extra retrieval.

Key Findings

VERDICT yields large average grounded-F1 gains vs strong baselines.

Numbersavg +23% G-F1 across backbone LLMs

Practical UseUse VERDICT to raise verifiable disambiguation quality when you care about grounded answers.

Evidence RefAbstract; Sec.1; Sec.5

VERDICT needs only one retriever call per query instead of one per interpretation.

Numbersretriever calls: O(1) vs O(|interpretations|); k tuned to 20

Practical UseExpect lower retrieval cost and fewer search queries; scale retrieval once per user query.

Evidence RefSec.4.1; Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
G-F1 (grounded F1) on ASQA	70.82 (VERDICT, GPT-4o)	41.79 (DtV, GPT-4o)	+29.03 pts	ASQA dev	Table 2 (GPT-4o rows)	Table 2
G-F1 (grounded F1) on ASQA	67.80 (VERDICT, LLaMA 70B)	48.00 (DtV, LLaMA 70B)	+19.80 pts	ASQA dev	Table 2 (LLaMA 70B rows)	Table 2

What To Try In 7 Days

Implement a single high-recall retrieval pass (top-k, k≈20) and run LLM per passage to extract question-answer pairs.

Cluster the per-passage Q&A embeddings (HDBSCAN) and pick medoids to deduplicate clarifications.

Tune clustering to reach a business-preferred precision/recall trade-off (conservative vs default).

Agent Features

Tool Use

retrieverLLM generatorembedding encoder

Frameworks

VERDICT

Is Agentic

Yes

Architectures

single-agent pipeline

Optimization Features

Token Efficiency

avoid long multi-passage LLM inputs

Infra Optimization

reduced retriever queries; fewer large-model long-context runs

System Optimization

use embedding encoder reused from retrieval for clustering

Inference Optimization

single retrieval per queryper-passage LLM calls to minimize context lengthparallelizable passage-level LLM inferences

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Method assumes a reasonably good retriever; poor retrieval coverage breaks verified diversification.

Not evaluated under adversarial or extremely degraded retriever/generator behaviors.

When Not To Use

If your retriever cannot return relevant passages in top-k for typical queries.

When you cannot afford many per-passage LLM calls (compute-constrained environments).

Failure Modes

Missing interpretation if correct passage is outside top-k retrieval.

Smaller LLMs struggle to detect unanswerable passages (higher interpretation error rate for 8B models).

Core Entities

Models

LLaMA 3.1 8BLLaMA 3.3 70BGPT-4ogte-Qwen2-7B-instructarctic-embed

Metrics

G-PrecisionG-RecallG-F1Ungrounded RecallAverage # interpretations (|Q|)

Datasets

ASQAAmbigNQWikipedia (corpus used in ASQA)

Benchmarks

ASQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

VERDICT yields large average grounded-F1 gains vs strong baselines.

VERDICT needs only one retriever call per query instead of one per interpretation.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding