Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
APO reduces unsupported claims and improves citation accuracy by fine-tuning models with automated preference pairs, letting product teams boost trustworthiness without huge human-label budgets.
Summary TLDR
This paper treats citation-aware generation as a preference-learning problem and introduces APO (Automatic Preference Optimization). The authors (1) post-train an LLM on 6,330 curated attribution examples, (2) auto-synthesize 95,263 preference pairs covering several error types, and (3) apply a progressive, statement-level preference optimization (with experience replay) using LoRA. On ASQA, StrategyQA and ELI5, APO raises citation F1 and often improves answer correctness versus retrieval-only baselines like Self-RAG and AGREE.
Problem Statement
LLMs often produce believable but unsupported claims and incorrect citations. Existing fixes focus on retrieval or post-hoc linking, not on teaching the model to prefer well-attributed generations. Manual pairwise labels for preference learning are costly. The paper asks: can we automatically create preference data and directly fine-tune models so they generate correct answers with faithful citations?
Main Contribution
Formulate attributed text generation as a preference learning problem and introduce the APO framework.
Build a post-training attribution corpus of 6,330 curated examples from EVIGSE, ExpertQA, and HAGRID.
Propose an automated pipeline that synthesizes 95,263 preference pairs covering predefined error types (fabrication, mistaken synthesis, omission, irrelevant-but-supported).
Introduce progressive statement-level preference optimization with experience replay to strengthen fine-grained preferences without explicit reward models.
Show consistent gains on ASQA, StrategyQA and ELI5 in citation F1 and often in answer correctness compared to baselines.
Key Findings
APO improves ASQA citation F1 from 63.5 to 71.2 after preference optimization.
APO raises ASQA exact-match short answer (EM-R) to 40.5, outperforming Self-RAG by 8.8 points on EM-R.
Synthesized preference data scale: 95,263 auto-generated preference pairs used for PO.
Error breakdown in human review: fabrication accounts for 48.4% of attribution errors.
Results
ASQA EM-R (correct short answer recall)
ASQA Citation F1
Accuracy
ELI5 Claim Recall
Who Should Care
What To Try In 7 Days
Fine-tune a current LLM on a small curated attribution set (~6k examples) to teach citation format.
Auto-synthesize negative preference examples (fabrication/omission) around your retrievals and create pairwise data.
Apply a direct PO method (e.g., IPO or progressive statement-level PO) with LoRA to limit compute and test citation F1 on a held-out QA set.
Optimization Features
Infra Optimization
- Experiments run on NVIDIA A100 80G GPUs
Model Optimization
- LoRA
System Optimization
- Mix post-training autoregressive loss with PO at intervals to avoid degeneration
Training Optimization
- Progressive statement-level preference optimization
- Experience replay to reduce overfitting
Inference Optimization
- Low decoding temperature (0.01) for deterministic outputs
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Data scope is narrow: post-training and PO syntheses come mainly from specific attribution datasets and Wikipedia.
- Synthesized preference data might not cover all real-world hallucination modes or domains.
- Method depends on retrieval quality; low-quality or truncated documents can cause fabrication.
- Code and full datasets are not yet released, limiting immediate reproducibility.
When Not To Use
- If you lack a reliable retriever or your document sources are low quality or proprietary.
- When domain coverage differs strongly from Wikipedia-style sources used in training.
- If you cannot afford the compute for finetuning even with LoRA on ~13B models.
Failure Modes
- Model fabricates facts and attributes them to irrelevant documents when retrievals are poor.
- Overfitting to deterministic synthesized preferences if experience replay is not used.
- Incomplete answers due to omission errors when error-type coverage is insufficient.
Core Entities
Models
- llama-2-13b-base
- llama-2-13b-chat
- selfrag_llama2_7b (critic)
- gtr-t5-large (retriever)
Metrics
- Citation Precision
- Citation Recall
- Citation F1
- EM-R (exact-match recall short answers)
- Claim recall
- Accuracy
Datasets
- ASQA
- StrategyQA
- ELI5
- EVIGSE
- ExpertQA
- HAGRID
Benchmarks
- ASQA
- StrategyQA
- ELI5

