ALCE: a reproducible benchmark and metrics to make LLM answers cite their sources

May 24, 20238 min

Overview

Decision SnapshotNeeds Validation

The benchmark and metrics are validated by human evaluation and broad experiments. Results are robust but limited by NLI accuracy and dataset coverage; expect improvements when retriever quality or long-context models improve.

Citations10

Evidence Strength0.80

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Tianyu Gao, Howard Yen, Jiatong Yu, Danqi Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you build customer-facing assistants, ALCE gives a reproducible way to measure whether answers are supported by sources and helps reduce user mistrust from hallucinations.

Who Should Care

Summary TLDR

The paper introduces ALCE, a reproducible benchmark to evaluate LLMs that must generate long answers with explicit citations. ALCE includes three datasets (ASQA, QAMPARI, ELI5), a retrieval setup over 100-word passages, and automatic metrics for fluency (MAUVE), correctness (dataset-specific measures), and citation quality (an NLI model called TRUE). The authors test many prompting strategies and models (ChatGPT, GPT-4, LLaMA variants) and show simple "VANILLA" prompting with top-k retrieved passages is a strong baseline, but major gaps remain: retrieval quality, context-window limits, and multi-document synthesis. Code and data are public.

Problem Statement

LLMs produce fluent answers but often hallucinate and lack verifiable sources. Prior systems use closed commercial search and human-only evaluation, which makes comparisons and reproduction hard. The paper builds a reproducible benchmark and automatic metrics to measure whether LLMs' generated statements are supported by cited passages.

Main Contribution

ALCE benchmark: three datasets (ASQA, QAMPARI, ELI5) with retrieval corpora and a 100-word passage format.

Automatic evaluation suite measuring fluency, correctness, and citation quality, validated by human judgments.

Key Findings

Many best-performing models still fail to fully support their answers with cited passages on open-ended questions.

Numbers≈50% of generations lack full citation support on ELI5 (ChatGPT/GPT-4)

Practical UseDon't assume an LLM's citations prove correctness; verify citations or improve retriever/synthesis before deploying for verifiable answers.

Evidence RefAbstract; Table 6

A simple prompt that places top-k retrieved passages in context (VANILLA) provides a strong baseline.

NumbersChatGPT VANILLA (5-psg) citation recall 73.6% on ASQA

Practical UseStart system development with VANILLA (top-k in context) as a quick, competitive baseline before complex retrieval flows.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ASQA ChatGPT VANILLA (5 passages) - Fluency (MAUVE)66.6ASQATable 4 (MAUVE)Table 4
ASQA ChatGPT VANILLA (5 passages) - Correctness (EM recall)40.4ASQATable 4 (EM recall)Table 4

What To Try In 7 Days

Run VANILLA: feed top-k retrieved passages to your LLM and measure citation recall with an NLI model.

Generate 4 answers and apply RERANK by citation recall to improve verifiable output.

Swap retriever to a stronger dense retriever (e.g., GTR) and compare retrieval R@k vs downstream correctness.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

MAUVE can be unstable and sensitive to output length.

ELI5 claim-generation may not capture all valid answer aspects for open questions.

When Not To Use

When you require page-level or web-page citations instead of 100-word passages.

When your task requires multi-hop reasoning or precise math verification not represented in ALCE.

Failure Modes

Retriever fails to return supporting passages, limiting downstream citation recall.

LLM gets distracted by irrelevant passages and omits key facts.

Core Entities

Models

ChatGPT (gpt-3.5-turbo)ChatGPT-16KGPT-4LLaMAVicunaAlpacaLLaMA-2-ChatFusion-in-Decoder (FiD)

Metrics

MAUVE (fluency)Exact-Match Recall (ASQA)Precision/Recall and Recall-5 (QAMPARI)Claim Recall (ELI5 via generated sub-claims)Citation Recall (NLI-based)Citation Precision (NLI-based)

Datasets

ASQAQAMPARIELI5Wikipedia (2018-12-20 snapshot)Sphere (filtered Common Crawl)

Benchmarks

ALCE