Use automated preference learning to make LLM answers cite sources more reliably

Overview

Decision SnapshotNeeds Validation

The method shows reproducible gains on three public QA sets and uses standard tools (LoRA, gtr-t5-large). Results are promising but limited by dataset scope, reliance on retrieval quality, and synthesized pairs.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Dongfang Li, Zetian Sun, Baotian Hu, Zhenyu Liu, Xinshuo Hu, Xuebo Liu, Min Zhang

Links

Abstract / PDF

Why It Matters For Business

APO reduces unsupported claims and improves citation accuracy by fine-tuning models with automated preference pairs, letting product teams boost trustworthiness without huge human-label budgets.

Who Should Care

Product Manager ML Engineer Founder Data Scientist

Summary TLDR

This paper treats citation-aware generation as a preference-learning problem and introduces APO (Automatic Preference Optimization). The authors (1) post-train an LLM on 6,330 curated attribution examples, (2) auto-synthesize 95,263 preference pairs covering several error types, and (3) apply a progressive, statement-level preference optimization (with experience replay) using LoRA. On ASQA, StrategyQA and ELI5, APO raises citation F1 and often improves answer correctness versus retrieval-only baselines like Self-RAG and AGREE.

Problem Statement

LLMs often produce believable but unsupported claims and incorrect citations. Existing fixes focus on retrieval or post-hoc linking, not on teaching the model to prefer well-attributed generations. Manual pairwise labels for preference learning are costly. The paper asks: can we automatically create preference data and directly fine-tune models so they generate correct answers with faithful citations?

Main Contribution

Formulate attributed text generation as a preference learning problem and introduce the APO framework.

Build a post-training attribution corpus of 6,330 curated examples from EVIGSE, ExpertQA, and HAGRID.

Key Findings

APO improves ASQA citation F1 from 63.5 to 71.2 after preference optimization.

NumbersASQA citation F1: 63.5 -> 71.2

Practical UseApply preference optimization on top of post-training to boost citation accuracy by ~7–8 F1 points on ASQA-like tasks.

Evidence RefTable 2

APO raises ASQA exact-match short answer (EM-R) to 40.5, outperforming Self-RAG by 8.8 points on EM-R.

NumbersASQA EM-R: APO 40.5 vs Self-RAG 31.7 (+8.8)

Practical UseUsing APO can yield materially better factual answers on long-form QA benchmarks versus retrieval-only baselines; consider when correctness matters.

Evidence RefTable 2, Sec. 6.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ASQA EM-R (correct short answer recall)	40.5	Self-RAG 31.7	+8.8	ASQA test	APO (our method) EM-R 40.5 vs Self-RAG 31.7	Table 2
ASQA Citation F1	71.2	Post-training 63.5	+7.7	ASQA test	APO citation F1 71.2; post-training baseline 63.5	Table 2

What To Try In 7 Days

Fine-tune a current LLM on a small curated attribution set (~6k examples) to teach citation format.

Auto-synthesize negative preference examples (fabrication/omission) around your retrievals and create pairwise data.

Apply a direct PO method (e.g., IPO or progressive statement-level PO) with LoRA to limit compute and test citation F1 on a held-out QA set.

Optimization Features

Infra Optimization

Experiments run on NVIDIA A100 80G GPUs

Model Optimization

LoRA

System Optimization

Mix post-training autoregressive loss with PO at intervals to avoid degeneration

Training Optimization

Progressive statement-level preference optimizationExperience replay to reduce overfitting

Inference Optimization

Low decoding temperature (0.01) for deterministic outputs

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Data scope is narrow: post-training and PO syntheses come mainly from specific attribution datasets and Wikipedia.

Synthesized preference data might not cover all real-world hallucination modes or domains.

When Not To Use

If you lack a reliable retriever or your document sources are low quality or proprietary.

When domain coverage differs strongly from Wikipedia-style sources used in training.

Failure Modes

Model fabricates facts and attributes them to irrelevant documents when retrievals are poor.

Overfitting to deterministic synthesized preferences if experience replay is not used.

Core Entities

Models

llama-2-13b-basellama-2-13b-chatselfrag_llama2_7b (critic)gtr-t5-large (retriever)

Metrics

Citation PrecisionCitation RecallCitation F1EM-R (exact-match recall short answers)Claim recallAccuracy

Datasets

ASQAStrategyQAELI5EVIGSEExpertQAHAGRID

Benchmarks

ASQAStrategyQAELI5

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

APO improves ASQA citation F1 from 63.5 to 71.2 after preference optimization.

APO raises ASQA exact-match short answer (EM-R) to 40.5, outperforming Self-RAG by 8.8 points on EM-R.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Use a small assistant LLM to remove teacher-model favoritism from proxy judge training

Key finding

Use synthetic crowd comparisons to make LLM judges give deeper, more reliable chain-of-thought evaluations

Key finding