ALCE: a reproducible benchmark and metrics to make LLM answers cite their sources

Overview

Decision SnapshotNeeds Validation

The benchmark and metrics are validated by human evaluation and broad experiments. Results are robust but limited by NLI accuracy and dataset coverage; expect improvements when retriever quality or long-context models improve.

Citations10

Evidence Strength0.80

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Tianyu Gao, Howard Yen, Jiatong Yu, Danqi Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you build customer-facing assistants, ALCE gives a reproducible way to measure whether answers are supported by sources and helps reduce user mistrust from hallucinations.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

The paper introduces ALCE, a reproducible benchmark to evaluate LLMs that must generate long answers with explicit citations. ALCE includes three datasets (ASQA, QAMPARI, ELI5), a retrieval setup over 100-word passages, and automatic metrics for fluency (MAUVE), correctness (dataset-specific measures), and citation quality (an NLI model called TRUE). The authors test many prompting strategies and models (ChatGPT, GPT-4, LLaMA variants) and show simple "VANILLA" prompting with top-k retrieved passages is a strong baseline, but major gaps remain: retrieval quality, context-window limits, and multi-document synthesis. Code and data are public.

Problem Statement

LLMs produce fluent answers but often hallucinate and lack verifiable sources. Prior systems use closed commercial search and human-only evaluation, which makes comparisons and reproduction hard. The paper builds a reproducible benchmark and automatic metrics to measure whether LLMs' generated statements are supported by cited passages.

Main Contribution

ALCE benchmark: three datasets (ASQA, QAMPARI, ELI5) with retrieval corpora and a 100-word passage format.

Automatic evaluation suite measuring fluency, correctness, and citation quality, validated by human judgments.

Key Findings

Many best-performing models still fail to fully support their answers with cited passages on open-ended questions.

Numbers≈50% of generations lack full citation support on ELI5 (ChatGPT/GPT-4)

Practical UseDon't assume an LLM's citations prove correctness; verify citations or improve retriever/synthesis before deploying for verifiable answers.

Evidence RefAbstract; Table 6

A simple prompt that places top-k retrieved passages in context (VANILLA) provides a strong baseline.

NumbersChatGPT VANILLA (5-psg) citation recall 73.6% on ASQA

Practical UseStart system development with VANILLA (top-k in context) as a quick, competitive baseline before complex retrieval flows.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ASQA ChatGPT VANILLA (5 passages) - Fluency (MAUVE)	66.6	—	—	ASQA	Table 4 (MAUVE)	Table 4
ASQA ChatGPT VANILLA (5 passages) - Correctness (EM recall)	40.4	—	—	ASQA	Table 4 (EM recall)	Table 4

What To Try In 7 Days

Run VANILLA: feed top-k retrieved passages to your LLM and measure citation recall with an NLI model.

Generate 4 answers and apply RERANK by citation recall to improve verifiable output.

Swap retriever to a stronger dense retriever (e.g., GTR) and compare retrieval R@k vs downstream correctness.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/princeton-nlp/ALCE

Data URLs

https://github.com/princeton-nlp/ALCE https://arxiv.org/abs/2305.14627

Risks & Boundaries

Limitations

MAUVE can be unstable and sensitive to output length.

ELI5 claim-generation may not capture all valid answer aspects for open questions.

When Not To Use

When you require page-level or web-page citations instead of 100-word passages.

When your task requires multi-hop reasoning or precise math verification not represented in ALCE.

Failure Modes

Retriever fails to return supporting passages, limiting downstream citation recall.

LLM gets distracted by irrelevant passages and omits key facts.

Core Entities

Models

ChatGPT (gpt-3.5-turbo)ChatGPT-16KGPT-4LLaMAVicunaAlpacaLLaMA-2-ChatFusion-in-Decoder (FiD)

Metrics

MAUVE (fluency)Exact-Match Recall (ASQA)Precision/Recall and Recall-5 (QAMPARI)Claim Recall (ELI5 via generated sub-claims)Citation Recall (NLI-based)Citation Precision (NLI-based)

Datasets

ASQAQAMPARIELI5Wikipedia (2018-12-20 snapshot)Sphere (filtered Common Crawl)

Benchmarks

ALCE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Many best-performing models still fail to fully support their answers with cited passages on open-ended questions.

A simple prompt that places top-k retrieved passages in context (VANILLA) provides a strong baseline.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding