ALCE: a reproducible benchmark and metrics to make LLM answers cite their sources

May 24, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

10

Authors

Tianyu Gao, Howard Yen, Jiatong Yu, Danqi Chen

Links

Abstract / PDF

Why It Matters For Business

If you build customer-facing assistants, ALCE gives a reproducible way to measure whether answers are supported by sources and helps reduce user mistrust from hallucinations.

Summary TLDR

The paper introduces ALCE, a reproducible benchmark to evaluate LLMs that must generate long answers with explicit citations. ALCE includes three datasets (ASQA, QAMPARI, ELI5), a retrieval setup over 100-word passages, and automatic metrics for fluency (MAUVE), correctness (dataset-specific measures), and citation quality (an NLI model called TRUE). The authors test many prompting strategies and models (ChatGPT, GPT-4, LLaMA variants) and show simple "VANILLA" prompting with top-k retrieved passages is a strong baseline, but major gaps remain: retrieval quality, context-window limits, and multi-document synthesis. Code and data are public.

Problem Statement

LLMs produce fluent answers but often hallucinate and lack verifiable sources. Prior systems use closed commercial search and human-only evaluation, which makes comparisons and reproduction hard. The paper builds a reproducible benchmark and automatic metrics to measure whether LLMs' generated statements are supported by cited passages.

Main Contribution

ALCE benchmark: three datasets (ASQA, QAMPARI, ELI5) with retrieval corpora and a 100-word passage format.

Automatic evaluation suite measuring fluency, correctness, and citation quality, validated by human judgments.

A systematic study of retrieval and prompting strategies (VANILLA, SUMM, SNIPPET, INTERACT, INLINESEARCH, RERANK, POSTCITE) across closed and open models.

Empirical findings and failure analyses highlighting retrieval, context-window, and multi-document synthesis as key bottlenecks.

Public release of code and data for reproducible comparisons (GitHub).

Key Findings

Many best-performing models still fail to fully support their answers with cited passages on open-ended questions.

Numbers≈50% of generations lack full citation support on ELI5 (ChatGPT/GPT-4)

A simple prompt that places top-k retrieved passages in context (VANILLA) provides a strong baseline.

NumbersChatGPT VANILLA (5-psg) citation recall 73.6% on ASQA

Summaries or extracted snippets can increase factual coverage but reduce citation faithfulness.

NumbersASQA: Correctness +2.9 pts, Citation Rec. −4.7 pts (SUMM 10-psg vs VANILLA 5-psg)

Reranking multiple generated answers by automated citation recall noticeably improves citation quality.

NumbersASQA: citation recall +11.2 pts with RERANK (73.6 → 84.8)

Closed-book generation plus post-hoc matching (POSTCITE) can yield good-looking correct answers but poor real citation quality.

NumbersASQA: CLOSEDBOOK+POSTCITE citation recall worse by ~47% vs VANILLA (paper claim)

Retrieval quality sets an upper bound on performance and current retrievers leave substantial room for gains.

NumbersASQA retrieval R@5 (GTR) = 56.8%; R@100 = 78.4%

Results

ASQA ChatGPT VANILLA (5 passages) - Fluency (MAUVE)

Value66.6

ASQA ChatGPT VANILLA (5 passages) - Correctness (EM recall)

Value40.4

ASQA ChatGPT VANILLA (5 passages) - Citation Recall

Value73.6

QAMPARI ChatGPT VANILLA (5 passages) - Correctness (Rec.-5)

Value20.8

ELI5 ChatGPT VANILLA (5 passages) - Correctness (Claim recall)

Value12.0

ELI5 ChatGPT VANILLA (5 passages) - Citation Recall

Value51.1

Who Should Care

What To Try In 7 Days

Run VANILLA: feed top-k retrieved passages to your LLM and measure citation recall with an NLI model.

Generate 4 answers and apply RERANK by citation recall to improve verifiable output.

Swap retriever to a stronger dense retriever (e.g., GTR) and compare retrieval R@k vs downstream correctness.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • MAUVE can be unstable and sensitive to output length.
  • ELI5 claim-generation may not capture all valid answer aspects for open questions.
  • Citation checks rely on an NLI model that cannot detect 'partial support' reliably.
  • Datasets do not cover multi-hop math or code-heavy tasks.
  • Paper focuses on prompting; training models to natively generate citations is left to future work.

When Not To Use

  • When you require page-level or web-page citations instead of 100-word passages.
  • When your task requires multi-hop reasoning or precise math verification not represented in ALCE.
  • If you need guaranteed detection of partial support by citations (NLI has blind spots).

Failure Modes

  • Retriever fails to return supporting passages, limiting downstream citation recall.
  • LLM gets distracted by irrelevant passages and omits key facts.
  • Post-hoc citation matching attaches citations that don't truly support generated text.
  • NLI classifier mislabels partial support as irrelevant, inflating false negatives.

Core Entities

Models

  • ChatGPT (gpt-3.5-turbo)
  • ChatGPT-16K
  • GPT-4
  • LLaMA
  • Vicuna
  • Alpaca
  • LLaMA-2-Chat
  • Fusion-in-Decoder (FiD)

Metrics

  • MAUVE (fluency)
  • Exact-Match Recall (ASQA)
  • Precision/Recall and Recall-5 (QAMPARI)
  • Claim Recall (ELI5 via generated sub-claims)
  • Citation Recall (NLI-based)
  • Citation Precision (NLI-based)

Datasets

  • ASQA
  • QAMPARI
  • ELI5
  • Wikipedia (2018-12-20 snapshot)
  • Sphere (filtered Common Crawl)

Benchmarks

  • ALCE