CorpusLM: unify generative retrieval and continuous RAG into one model

February 2, 20248 min

Overview

Decision SnapshotReady For Pilot

The paper shows consistent improvements across 11 KILT datasets and reports concrete efficiency gains; ablations identify which parts matter. Evidence is empirical on Wikipedia+KILT; expect extra integration work for non-Wikipedia or live corpora.

Citations1

Evidence Strength0.80

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 65%

Authors

Xiaoxi Li, Zhicheng Dou, Yujia Zhou, Fangchao Liu

Links

Abstract / PDF / Data

Why It Matters For Business

CorpusLM can replace a heavy index+reader stack with a single model that reduces storage and latency while improving factual retrieval and downstream answers on wiki-like corpora, lowering hosting and inference costs for knowledge-driven products.

Who Should Care

Summary TLDR

CorpusLM is a single autoregressive model that combines generative retrieval (generate document IDs), closed-book generation, and a continuous retrieval-augmented generation (RAG) flow that decodes DocIDs, extracts fine-grained references, then outputs the answer. Key training ideas are ranking-oriented DocID list generation, unsupervised DocID understanding tasks, and a continuous DocIDs→references→answer decoding with noise sampling. Evaluated on KILT (11 datasets), CorpusLM (T5 and Llama2 backbones) improves retrieval R-Precision and downstream scores versus dense, sparse, and prior generative retrievers, while cutting storage and latency versus index-based RAG.

Problem Statement

Large LMs hallucinate on knowledge-intensive tasks. Traditional RAG uses a separate index and reader, costing memory and blocking end-to-end training. Generative retrieval (produce DocIDs) exists, but prior work rarely unifies retrieval and answer generation in one greedy decoding process or trains the model to rank DocIDs and use them as intermediate references.

Main Contribution

CorpusLM: a unified autoregressive model that performs generative retrieval, closed-book generation, and continuous RAG in one greedy decoding pass.

Ranking-oriented DocID list generation: train on ranked lists of DocIDs and use dynamic prefix-tree constraints to generate valid non-repetitive DocID lists.

Key Findings

CorpusLM improves passage retrieval R-Precision on KILT FEVER over a strong dense baseline (MT-DPR).

NumbersFEVER R-Precision: CorpusLM (T5) 75.64 vs MT-DPR 64.05+11.59 pp)

Practical UseIf you need higher document recall for fact-checking, switching to CorpusLM-style generative retrieval can raise top-ranked retrieval quality on similar corpora.

Evidence RefTable 2

Continuous RAG with CorpusLM yields higher downstream accuracy on KILT FEVER than DPR+BART.

NumbersRAG accuracy (FEVER): CorpusLM (Llama2) 90.22 vs DPR+BART 88.11+2.11 pp)

Practical UseUsing a unified model that decodes DocIDs then references can give small but consistent downstream gains over separate retriever+reader pipelines.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Retrieval R-Precision (FEVER)75.64%MT-DPR 64.05%+11.59 ppKILT dev (FEVER)Table 2 shows CorpusLM (T5) 75.64 vs MT-DPR 64.05Table 2
Accuracy90.22%DPR+BART 88.11%+2.11 ppKILT dev (FEVER) RAG settingTable 3 shows CorpusLM (Llama2) 90.22 vs DPR+BART 88.11Table 3

What To Try In 7 Days

Run a small pilot: finetune CorpusLM on your FAQ/KB using its DocID list format and compare top-k retrieval quality to your current retriever.

Implement the continuous decode flow: generate DocIDs, short references, then answers to see if answers become more grounded.

Add DocID understanding tasks (generate summary from DocID) to your finetuning mix and measure retrieval ranking gains.

Agent Features

Memory
DocID-based retrieval (generative)

Optimization Features

Token Efficiency
decode short references before answers to reduce irrelevant context
Infra Optimization
smaller model footprint avoids large dense index storage
Model Optimization
LoRA
System Optimization
DeepSpeed for Llama2 finetuning
Training Optimization
multi-task finetuning across retrieval, closed-book, RAG and DocID tasksnoise sampling during reference training
Inference Optimization
dynamic constrained greedy decoding for DocIDscontinuous single-pass decoding to avoid multi-round I/O

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

KILT benchmark (used for evaluation)English Wikipedia dump 2019-08-01 (used as corpus)

Risks & Boundaries

Limitations

Evaluations are limited to KILT tasks and a Wikipedia corpus; cross-domain or live-web performance is untested.

Method relies on pre-defined DocIDs and a prefix-tree constraint; custom DocID design may be needed per corpus.

When Not To Use

When your knowledge source is highly dynamic (live web) and DocIDs cannot be pre-built.

If you require standard dense-index features (ANN search with external indexes) or existing search infrastructure you cannot replace.

Failure Modes

Generated DocIDs may miss relevant documents if DocID design or training data lacks coverage.

If retrieved passages are noisy or missing, the model can still hallucinate despite reference decoding.

Core Entities

Models

CorpusLMT5-BaseLlama2-7B

Metrics

R-PrecisionAccuracyExact MatchROUGE-LF1has_answer

Datasets

KILTWikipedia (2019-08-01 dump)

Benchmarks

KILT

Context Entities

Models

BM25DPRMT-DPRRAGE5SimLMSEALCorpusBrainFIDBART

Datasets

FEVERAIDA CoNLL-YAGO (AY2)WNED-WIKI (WnWi)WNED-CWEB (WnCw)T-RExZero Shot RE (zsRE)Natural Questions (NQ)HotpotQA (HoPo)TriviaQA (TQA)ELI5Wizard of Wikipedia (WoW)