Overview
The benchmark and metrics are validated by human evaluation and broad experiments. Results are robust but limited by NLI accuracy and dataset coverage; expect improvements when retriever quality or long-context models improve.
Citations10
Evidence Strength0.80
Confidence0.90
Risk Signals12
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
If you build customer-facing assistants, ALCE gives a reproducible way to measure whether answers are supported by sources and helps reduce user mistrust from hallucinations.
Who Should Care
Summary TLDR
The paper introduces ALCE, a reproducible benchmark to evaluate LLMs that must generate long answers with explicit citations. ALCE includes three datasets (ASQA, QAMPARI, ELI5), a retrieval setup over 100-word passages, and automatic metrics for fluency (MAUVE), correctness (dataset-specific measures), and citation quality (an NLI model called TRUE). The authors test many prompting strategies and models (ChatGPT, GPT-4, LLaMA variants) and show simple "VANILLA" prompting with top-k retrieved passages is a strong baseline, but major gaps remain: retrieval quality, context-window limits, and multi-document synthesis. Code and data are public.
Problem Statement
LLMs produce fluent answers but often hallucinate and lack verifiable sources. Prior systems use closed commercial search and human-only evaluation, which makes comparisons and reproduction hard. The paper builds a reproducible benchmark and automatic metrics to measure whether LLMs' generated statements are supported by cited passages.
Main Contribution
ALCE benchmark: three datasets (ASQA, QAMPARI, ELI5) with retrieval corpora and a 100-word passage format.
Automatic evaluation suite measuring fluency, correctness, and citation quality, validated by human judgments.
Key Findings
Many best-performing models still fail to fully support their answers with cited passages on open-ended questions.
A simple prompt that places top-k retrieved passages in context (VANILLA) provides a strong baseline.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ASQA ChatGPT VANILLA (5 passages) - Fluency (MAUVE) | 66.6 | — | — | ASQA | Table 4 (MAUVE) | Table 4 |
| ASQA ChatGPT VANILLA (5 passages) - Correctness (EM recall) | 40.4 | — | — | ASQA | Table 4 (EM recall) | Table 4 |
What To Try In 7 Days
Run VANILLA: feed top-k retrieved passages to your LLM and measure citation recall with an NLI model.
Generate 4 answers and apply RERANK by citation recall to improve verifiable output.
Swap retriever to a stronger dense retriever (e.g., GTR) and compare retrieval R@k vs downstream correctness.
Reproducibility
Risks & Boundaries
Limitations
MAUVE can be unstable and sensitive to output length.
ELI5 claim-generation may not capture all valid answer aspects for open questions.
When Not To Use
When you require page-level or web-page citations instead of 100-word passages.
When your task requires multi-hop reasoning or precise math verification not represented in ALCE.
Failure Modes
Retriever fails to return supporting passages, limiting downstream citation recall.
LLM gets distracted by irrelevant passages and omits key facts.

