Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
10
Why It Matters For Business
If you build customer-facing assistants, ALCE gives a reproducible way to measure whether answers are supported by sources and helps reduce user mistrust from hallucinations.
Summary TLDR
The paper introduces ALCE, a reproducible benchmark to evaluate LLMs that must generate long answers with explicit citations. ALCE includes three datasets (ASQA, QAMPARI, ELI5), a retrieval setup over 100-word passages, and automatic metrics for fluency (MAUVE), correctness (dataset-specific measures), and citation quality (an NLI model called TRUE). The authors test many prompting strategies and models (ChatGPT, GPT-4, LLaMA variants) and show simple "VANILLA" prompting with top-k retrieved passages is a strong baseline, but major gaps remain: retrieval quality, context-window limits, and multi-document synthesis. Code and data are public.
Problem Statement
LLMs produce fluent answers but often hallucinate and lack verifiable sources. Prior systems use closed commercial search and human-only evaluation, which makes comparisons and reproduction hard. The paper builds a reproducible benchmark and automatic metrics to measure whether LLMs' generated statements are supported by cited passages.
Main Contribution
ALCE benchmark: three datasets (ASQA, QAMPARI, ELI5) with retrieval corpora and a 100-word passage format.
Automatic evaluation suite measuring fluency, correctness, and citation quality, validated by human judgments.
A systematic study of retrieval and prompting strategies (VANILLA, SUMM, SNIPPET, INTERACT, INLINESEARCH, RERANK, POSTCITE) across closed and open models.
Empirical findings and failure analyses highlighting retrieval, context-window, and multi-document synthesis as key bottlenecks.
Public release of code and data for reproducible comparisons (GitHub).
Key Findings
Many best-performing models still fail to fully support their answers with cited passages on open-ended questions.
A simple prompt that places top-k retrieved passages in context (VANILLA) provides a strong baseline.
Summaries or extracted snippets can increase factual coverage but reduce citation faithfulness.
Reranking multiple generated answers by automated citation recall noticeably improves citation quality.
Closed-book generation plus post-hoc matching (POSTCITE) can yield good-looking correct answers but poor real citation quality.
Retrieval quality sets an upper bound on performance and current retrievers leave substantial room for gains.
Results
ASQA ChatGPT VANILLA (5 passages) - Fluency (MAUVE)
ASQA ChatGPT VANILLA (5 passages) - Correctness (EM recall)
ASQA ChatGPT VANILLA (5 passages) - Citation Recall
QAMPARI ChatGPT VANILLA (5 passages) - Correctness (Rec.-5)
ELI5 ChatGPT VANILLA (5 passages) - Correctness (Claim recall)
ELI5 ChatGPT VANILLA (5 passages) - Citation Recall
Who Should Care
What To Try In 7 Days
Run VANILLA: feed top-k retrieved passages to your LLM and measure citation recall with an NLI model.
Generate 4 answers and apply RERANK by citation recall to improve verifiable output.
Swap retriever to a stronger dense retriever (e.g., GTR) and compare retrieval R@k vs downstream correctness.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- MAUVE can be unstable and sensitive to output length.
- ELI5 claim-generation may not capture all valid answer aspects for open questions.
- Citation checks rely on an NLI model that cannot detect 'partial support' reliably.
- Datasets do not cover multi-hop math or code-heavy tasks.
- Paper focuses on prompting; training models to natively generate citations is left to future work.
When Not To Use
- When you require page-level or web-page citations instead of 100-word passages.
- When your task requires multi-hop reasoning or precise math verification not represented in ALCE.
- If you need guaranteed detection of partial support by citations (NLI has blind spots).
Failure Modes
- Retriever fails to return supporting passages, limiting downstream citation recall.
- LLM gets distracted by irrelevant passages and omits key facts.
- Post-hoc citation matching attaches citations that don't truly support generated text.
- NLI classifier mislabels partial support as irrelevant, inflating false negatives.
Core Entities
Models
- ChatGPT (gpt-3.5-turbo)
- ChatGPT-16K
- GPT-4
- LLaMA
- Vicuna
- Alpaca
- LLaMA-2-Chat
- Fusion-in-Decoder (FiD)
Metrics
- MAUVE (fluency)
- Exact-Match Recall (ASQA)
- Precision/Recall and Recall-5 (QAMPARI)
- Claim Recall (ELI5 via generated sub-claims)
- Citation Recall (NLI-based)
- Citation Precision (NLI-based)
Datasets
- ASQA
- QAMPARI
- ELI5
- Wikipedia (2018-12-20 snapshot)
- Sphere (filtered Common Crawl)
Benchmarks
- ALCE

