VERDICT: unify diversification and verification to produce grounded clarifications in RAG

February 14, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Youngwon Lee, Seung-won Hwang, Ruofan Wu, Feng Yan, Danmei Xu, Moutasem Akkad, Zhewei Yao, Yuxiong He

Links

Abstract / PDF

Why It Matters For Business

Cuts repeated search calls and produces clarifications you can cite, improving enterprise search accuracy and user trust while lowering retrieval cost.

Summary TLDR

The paper introduces VERDICT, a pipeline that integrates retrieval feedback and LLM execution feedback to generate disambiguated, verifiable sub-questions for vague queries. Instead of first making multiple interpretations and then verifying them, VERDICT retrieves a high-recall set of passages once, asks the LLM to propose only interpretations that each passage can answer, and clusters results to remove noise. On the ASQA benchmark (Wikipedia) VERDICT raises grounded F1 by ~23% on average versus strong Diversify-then-Verify baselines and reduces redundant retrieval and long-context LLM calls.

Problem Statement

Ambiguous short queries cause RAG systems to generate many ungrounded interpretations. The common Diversify-then-Verify (DtV) pipeline first creates interpretations from the LLM and then retrieves for each, causing wasted retrieval, noisy passages, cascading errors, and high verification cost. The paper aims to produce diverse interpretations that are grounded (verifiable) while cutting retrieval and verification overhead.

Main Contribution

VERDICT: a unified 'verified diversification' pipeline that conditions interpretation generation on a single retrieved universe and generator execution feedback.

Consolidation: cluster-based aggregation of per-passage interpretation-answer pairs to denoise and deduplicate without extra retrieval.

Empirical gains on ASQA: improves grounded F1 (verifiability) substantially and reduces retrieval/LLM costs compared to DtV baselines; code and evaluation framework released.

Key Findings

VERDICT yields large average grounded-F1 gains vs strong baselines.

Numbersavg +23% G-F1 across backbone LLMs

VERDICT needs only one retriever call per query instead of one per interpretation.

Numbersretriever calls: O(1) vs O(|interpretations|); k tuned to 20

VERDICT produces higher grounded precision and recall on ASQA.

NumbersGPT-4o G-F1: 70.82 (VERDICT) vs 41.79 (DtV)

VERDICT matches or exceeds human-level diversity while staying grounded.

Numbersavg interpretations ≈ 3.7 (VERDICT) vs 3.36 (human) on ASQA

Clustering (consolidation) controls precision/recall trade-off.

NumbersDefault |Q|=3.7 (G-Precision 81.5, G-Recall 58.0); Conservative |Q|=2.41 (G-Precision 82.4, G-Recall 50.7)

Results

G-F1 (grounded F1) on ASQA

Value70.82 (VERDICT, GPT-4o)

Baseline41.79 (DtV, GPT-4o)

G-F1 (grounded F1) on ASQA

Value67.80 (VERDICT, LLaMA 70B)

Baseline48.00 (DtV, LLaMA 70B)

Average # interpretations per query

Value≈3.7 (VERDICT)

Baseline≈1.36–3.78 (DtV varies by model)

Who Should Care

What To Try In 7 Days

Implement a single high-recall retrieval pass (top-k, k≈20) and run LLM per passage to extract question-answer pairs.

Cluster the per-passage Q&A embeddings (HDBSCAN) and pick medoids to deduplicate clarifications.

Tune clustering to reach a business-preferred precision/recall trade-off (conservative vs default).

Agent Features

Tool Use

  • retriever
  • LLM generator
  • embedding encoder

Frameworks

  • VERDICT

Is Agentic

true

Architectures

  • single-agent pipeline

Optimization Features

Token Efficiency

  • avoid long multi-passage LLM inputs

Infra Optimization

  • reduced retriever queries; fewer large-model long-context runs

System Optimization

  • use embedding encoder reused from retrieval for clustering

Inference Optimization

  • single retrieval per query
  • per-passage LLM calls to minimize context length
  • parallelizable passage-level LLM inferences

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Method assumes a reasonably good retriever; poor retrieval coverage breaks verified diversification.
  • Not evaluated under adversarial or extremely degraded retriever/generator behaviors.
  • Main benchmark is ASQA (Wikipedia); enterprise corpora may need retriever tuning and prompt adjustments.

When Not To Use

  • If your retriever cannot return relevant passages in top-k for typical queries.
  • When you cannot afford many per-passage LLM calls (compute-constrained environments).
  • If adversarial or highly noisy corpora make consolidation unreliable.

Failure Modes

  • Missing interpretation if correct passage is outside top-k retrieval.
  • Smaller LLMs struggle to detect unanswerable passages (higher interpretation error rate for 8B models).
  • Clustering may merge distinct intents if embeddings fail to separate them.

Core Entities

Models

  • LLaMA 3.1 8B
  • LLaMA 3.3 70B
  • GPT-4o
  • gte-Qwen2-7B-instruct
  • arctic-embed

Metrics

  • G-Precision
  • G-Recall
  • G-F1
  • Ungrounded Recall
  • Average # interpretations (|Q|)

Datasets

  • ASQA
  • AmbigNQ
  • Wikipedia (corpus used in ASQA)

Benchmarks

  • ASQA