Use automated preference learning to make LLM answers cite sources more reliably

March 27, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Dongfang Li, Zetian Sun, Baotian Hu, Zhenyu Liu, Xinshuo Hu, Xuebo Liu, Min Zhang

Links

Abstract / PDF

Why It Matters For Business

APO reduces unsupported claims and improves citation accuracy by fine-tuning models with automated preference pairs, letting product teams boost trustworthiness without huge human-label budgets.

Summary TLDR

This paper treats citation-aware generation as a preference-learning problem and introduces APO (Automatic Preference Optimization). The authors (1) post-train an LLM on 6,330 curated attribution examples, (2) auto-synthesize 95,263 preference pairs covering several error types, and (3) apply a progressive, statement-level preference optimization (with experience replay) using LoRA. On ASQA, StrategyQA and ELI5, APO raises citation F1 and often improves answer correctness versus retrieval-only baselines like Self-RAG and AGREE.

Problem Statement

LLMs often produce believable but unsupported claims and incorrect citations. Existing fixes focus on retrieval or post-hoc linking, not on teaching the model to prefer well-attributed generations. Manual pairwise labels for preference learning are costly. The paper asks: can we automatically create preference data and directly fine-tune models so they generate correct answers with faithful citations?

Main Contribution

Formulate attributed text generation as a preference learning problem and introduce the APO framework.

Build a post-training attribution corpus of 6,330 curated examples from EVIGSE, ExpertQA, and HAGRID.

Propose an automated pipeline that synthesizes 95,263 preference pairs covering predefined error types (fabrication, mistaken synthesis, omission, irrelevant-but-supported).

Introduce progressive statement-level preference optimization with experience replay to strengthen fine-grained preferences without explicit reward models.

Show consistent gains on ASQA, StrategyQA and ELI5 in citation F1 and often in answer correctness compared to baselines.

Key Findings

APO improves ASQA citation F1 from 63.5 to 71.2 after preference optimization.

NumbersASQA citation F1: 63.5 -> 71.2

APO raises ASQA exact-match short answer (EM-R) to 40.5, outperforming Self-RAG by 8.8 points on EM-R.

NumbersASQA EM-R: APO 40.5 vs Self-RAG 31.7 (+8.8)

Synthesized preference data scale: 95,263 auto-generated preference pairs used for PO.

Numbers95,263 preference pairs

Error breakdown in human review: fabrication accounts for 48.4% of attribution errors.

NumbersFabrication 48.4% of errors

Results

ASQA EM-R (correct short answer recall)

Value40.5

BaselineSelf-RAG 31.7

ASQA Citation F1

Value71.2

BaselinePost-training 63.5

Accuracy

Value61.8

BaselineSelf-RAG 62.1

ELI5 Claim Recall

Value13.5

BaselineSelf-RAG 10.7

Who Should Care

What To Try In 7 Days

Fine-tune a current LLM on a small curated attribution set (~6k examples) to teach citation format.

Auto-synthesize negative preference examples (fabrication/omission) around your retrievals and create pairwise data.

Apply a direct PO method (e.g., IPO or progressive statement-level PO) with LoRA to limit compute and test citation F1 on a held-out QA set.

Optimization Features

Infra Optimization

  • Experiments run on NVIDIA A100 80G GPUs

Model Optimization

  • LoRA

System Optimization

  • Mix post-training autoregressive loss with PO at intervals to avoid degeneration

Training Optimization

  • Progressive statement-level preference optimization
  • Experience replay to reduce overfitting

Inference Optimization

  • Low decoding temperature (0.01) for deterministic outputs

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Data scope is narrow: post-training and PO syntheses come mainly from specific attribution datasets and Wikipedia.
  • Synthesized preference data might not cover all real-world hallucination modes or domains.
  • Method depends on retrieval quality; low-quality or truncated documents can cause fabrication.
  • Code and full datasets are not yet released, limiting immediate reproducibility.

When Not To Use

  • If you lack a reliable retriever or your document sources are low quality or proprietary.
  • When domain coverage differs strongly from Wikipedia-style sources used in training.
  • If you cannot afford the compute for finetuning even with LoRA on ~13B models.

Failure Modes

  • Model fabricates facts and attributes them to irrelevant documents when retrievals are poor.
  • Overfitting to deterministic synthesized preferences if experience replay is not used.
  • Incomplete answers due to omission errors when error-type coverage is insufficient.

Core Entities

Models

  • llama-2-13b-base
  • llama-2-13b-chat
  • selfrag_llama2_7b (critic)
  • gtr-t5-large (retriever)

Metrics

  • Citation Precision
  • Citation Recall
  • Citation F1
  • EM-R (exact-match recall short answers)
  • Claim recall
  • Accuracy

Datasets

  • ASQA
  • StrategyQA
  • ELI5
  • EVIGSE
  • ExpertQA
  • HAGRID

Benchmarks

  • ASQA
  • StrategyQA
  • ELI5