Use automated preference learning to make LLM answers cite sources more reliably

March 27, 20247 min

Overview

Decision SnapshotNeeds Validation

The method shows reproducible gains on three public QA sets and uses standard tools (LoRA, gtr-t5-large). Results are promising but limited by dataset scope, reliance on retrieval quality, and synthesized pairs.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Dongfang Li, Zetian Sun, Baotian Hu, Zhenyu Liu, Xinshuo Hu, Xuebo Liu, Min Zhang

Links

Abstract / PDF

Why It Matters For Business

APO reduces unsupported claims and improves citation accuracy by fine-tuning models with automated preference pairs, letting product teams boost trustworthiness without huge human-label budgets.

Who Should Care

Summary TLDR

This paper treats citation-aware generation as a preference-learning problem and introduces APO (Automatic Preference Optimization). The authors (1) post-train an LLM on 6,330 curated attribution examples, (2) auto-synthesize 95,263 preference pairs covering several error types, and (3) apply a progressive, statement-level preference optimization (with experience replay) using LoRA. On ASQA, StrategyQA and ELI5, APO raises citation F1 and often improves answer correctness versus retrieval-only baselines like Self-RAG and AGREE.

Problem Statement

LLMs often produce believable but unsupported claims and incorrect citations. Existing fixes focus on retrieval or post-hoc linking, not on teaching the model to prefer well-attributed generations. Manual pairwise labels for preference learning are costly. The paper asks: can we automatically create preference data and directly fine-tune models so they generate correct answers with faithful citations?

Main Contribution

Formulate attributed text generation as a preference learning problem and introduce the APO framework.

Build a post-training attribution corpus of 6,330 curated examples from EVIGSE, ExpertQA, and HAGRID.

Key Findings

APO improves ASQA citation F1 from 63.5 to 71.2 after preference optimization.

NumbersASQA citation F1: 63.5 -> 71.2

Practical UseApply preference optimization on top of post-training to boost citation accuracy by ~7–8 F1 points on ASQA-like tasks.

Evidence RefTable 2

APO raises ASQA exact-match short answer (EM-R) to 40.5, outperforming Self-RAG by 8.8 points on EM-R.

NumbersASQA EM-R: APO 40.5 vs Self-RAG 31.7 (+8.8)

Practical UseUsing APO can yield materially better factual answers on long-form QA benchmarks versus retrieval-only baselines; consider when correctness matters.

Evidence RefTable 2, Sec. 6.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ASQA EM-R (correct short answer recall)40.5Self-RAG 31.7+8.8ASQA testAPO (our method) EM-R 40.5 vs Self-RAG 31.7Table 2
ASQA Citation F171.2Post-training 63.5+7.7ASQA testAPO citation F1 71.2; post-training baseline 63.5Table 2

What To Try In 7 Days

Fine-tune a current LLM on a small curated attribution set (~6k examples) to teach citation format.

Auto-synthesize negative preference examples (fabrication/omission) around your retrievals and create pairwise data.

Apply a direct PO method (e.g., IPO or progressive statement-level PO) with LoRA to limit compute and test citation F1 on a held-out QA set.

Optimization Features

Infra Optimization
Experiments run on NVIDIA A100 80G GPUs
Model Optimization
LoRA
System Optimization
Mix post-training autoregressive loss with PO at intervals to avoid degeneration
Training Optimization
Progressive statement-level preference optimizationExperience replay to reduce overfitting
Inference Optimization
Low decoding temperature (0.01) for deterministic outputs

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Data scope is narrow: post-training and PO syntheses come mainly from specific attribution datasets and Wikipedia.

Synthesized preference data might not cover all real-world hallucination modes or domains.

When Not To Use

If you lack a reliable retriever or your document sources are low quality or proprietary.

When domain coverage differs strongly from Wikipedia-style sources used in training.

Failure Modes

Model fabricates facts and attributes them to irrelevant documents when retrievals are poor.

Overfitting to deterministic synthesized preferences if experience replay is not used.

Core Entities

Models

llama-2-13b-basellama-2-13b-chatselfrag_llama2_7b (critic)gtr-t5-large (retriever)

Metrics

Citation PrecisionCitation RecallCitation F1EM-R (exact-match recall short answers)Claim recallAccuracy

Datasets

ASQAStrategyQAELI5EVIGSEExpertQAHAGRID

Benchmarks

ASQAStrategyQAELI5