Overview
The method shows reproducible gains on three public QA sets and uses standard tools (LoRA, gtr-t5-large). Results are promising but limited by dataset scope, reliance on retrieval quality, and synthesized pairs.
Citations1
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
APO reduces unsupported claims and improves citation accuracy by fine-tuning models with automated preference pairs, letting product teams boost trustworthiness without huge human-label budgets.
Who Should Care
Summary TLDR
This paper treats citation-aware generation as a preference-learning problem and introduces APO (Automatic Preference Optimization). The authors (1) post-train an LLM on 6,330 curated attribution examples, (2) auto-synthesize 95,263 preference pairs covering several error types, and (3) apply a progressive, statement-level preference optimization (with experience replay) using LoRA. On ASQA, StrategyQA and ELI5, APO raises citation F1 and often improves answer correctness versus retrieval-only baselines like Self-RAG and AGREE.
Problem Statement
LLMs often produce believable but unsupported claims and incorrect citations. Existing fixes focus on retrieval or post-hoc linking, not on teaching the model to prefer well-attributed generations. Manual pairwise labels for preference learning are costly. The paper asks: can we automatically create preference data and directly fine-tune models so they generate correct answers with faithful citations?
Main Contribution
Formulate attributed text generation as a preference learning problem and introduce the APO framework.
Build a post-training attribution corpus of 6,330 curated examples from EVIGSE, ExpertQA, and HAGRID.
Key Findings
APO improves ASQA citation F1 from 63.5 to 71.2 after preference optimization.
APO raises ASQA exact-match short answer (EM-R) to 40.5, outperforming Self-RAG by 8.8 points on EM-R.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ASQA EM-R (correct short answer recall) | 40.5 | Self-RAG 31.7 | +8.8 | ASQA test | APO (our method) EM-R 40.5 vs Self-RAG 31.7 | Table 2 |
| ASQA Citation F1 | 71.2 | Post-training 63.5 | +7.7 | ASQA test | APO citation F1 71.2; post-training baseline 63.5 | Table 2 |
What To Try In 7 Days
Fine-tune a current LLM on a small curated attribution set (~6k examples) to teach citation format.
Auto-synthesize negative preference examples (fabrication/omission) around your retrievals and create pairwise data.
Apply a direct PO method (e.g., IPO or progressive statement-level PO) with LoRA to limit compute and test citation F1 on a held-out QA set.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Data scope is narrow: post-training and PO syntheses come mainly from specific attribution datasets and Wikipedia.
Synthesized preference data might not cover all real-world hallucination modes or domains.
When Not To Use
If you lack a reliable retriever or your document sources are low quality or proprietary.
When domain coverage differs strongly from Wikipedia-style sources used in training.
Failure Modes
Model fabricates facts and attributes them to irrelevant documents when retrievals are poor.
Overfitting to deterministic synthesized preferences if experience replay is not used.

