CorpusLM: unify generative retrieval and continuous RAG into one model

Overview

Decision SnapshotReady For Pilot

The paper shows consistent improvements across 11 KILT datasets and reports concrete efficiency gains; ablations identify which parts matter. Evidence is empirical on Wikipedia+KILT; expect extra integration work for non-Wikipedia or live corpora.

Citations1

Evidence Strength0.80

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 65%

Authors

Xiaoxi Li, Zhicheng Dou, Yujia Zhou, Fangchao Liu

Links

Abstract / PDF / Data

Why It Matters For Business

CorpusLM can replace a heavy index+reader stack with a single model that reduces storage and latency while improving factual retrieval and downstream answers on wiki-like corpora, lowering hosting and inference costs for knowledge-driven products.

Who Should Care

ML Engineer Product Manager CTO Data Scientist Engineering Lead

Summary TLDR

CorpusLM is a single autoregressive model that combines generative retrieval (generate document IDs), closed-book generation, and a continuous retrieval-augmented generation (RAG) flow that decodes DocIDs, extracts fine-grained references, then outputs the answer. Key training ideas are ranking-oriented DocID list generation, unsupervised DocID understanding tasks, and a continuous DocIDs→references→answer decoding with noise sampling. Evaluated on KILT (11 datasets), CorpusLM (T5 and Llama2 backbones) improves retrieval R-Precision and downstream scores versus dense, sparse, and prior generative retrievers, while cutting storage and latency versus index-based RAG.

Problem Statement

Large LMs hallucinate on knowledge-intensive tasks. Traditional RAG uses a separate index and reader, costing memory and blocking end-to-end training. Generative retrieval (produce DocIDs) exists, but prior work rarely unifies retrieval and answer generation in one greedy decoding process or trains the model to rank DocIDs and use them as intermediate references.

Main Contribution

CorpusLM: a unified autoregressive model that performs generative retrieval, closed-book generation, and continuous RAG in one greedy decoding pass.

Ranking-oriented DocID list generation: train on ranked lists of DocIDs and use dynamic prefix-tree constraints to generate valid non-repetitive DocID lists.

Key Findings

CorpusLM improves passage retrieval R-Precision on KILT FEVER over a strong dense baseline (MT-DPR).

NumbersFEVER R-Precision: CorpusLM (T5) 75.64 vs MT-DPR 64.05 (Δ+11.59 pp)

Practical UseIf you need higher document recall for fact-checking, switching to CorpusLM-style generative retrieval can raise top-ranked retrieval quality on similar corpora.

Evidence RefTable 2

Continuous RAG with CorpusLM yields higher downstream accuracy on KILT FEVER than DPR+BART.

NumbersRAG accuracy (FEVER): CorpusLM (Llama2) 90.22 vs DPR+BART 88.11 (Δ+2.11 pp)

Practical UseUsing a unified model that decodes DocIDs then references can give small but consistent downstream gains over separate retriever+reader pipelines.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Retrieval R-Precision (FEVER)	75.64%	MT-DPR 64.05%	+11.59 pp	KILT dev (FEVER)	Table 2 shows CorpusLM (T5) 75.64 vs MT-DPR 64.05	Table 2
Accuracy	90.22%	DPR+BART 88.11%	+2.11 pp	KILT dev (FEVER) RAG setting	Table 3 shows CorpusLM (Llama2) 90.22 vs DPR+BART 88.11	Table 3

What To Try In 7 Days

Run a small pilot: finetune CorpusLM on your FAQ/KB using its DocID list format and compare top-k retrieval quality to your current retriever.

Implement the continuous decode flow: generate DocIDs, short references, then answers to see if answers become more grounded.

Add DocID understanding tasks (generate summary from DocID) to your finetuning mix and measure retrieval ranking gains.

Agent Features

Memory

DocID-based retrieval (generative)

Optimization Features

Token Efficiency

decode short references before answers to reduce irrelevant context

Infra Optimization

smaller model footprint avoids large dense index storage

Model Optimization

LoRA

System Optimization

DeepSpeed for Llama2 finetuning

Training Optimization

multi-task finetuning across retrieval, closed-book, RAG and DocID tasksnoise sampling during reference training

Inference Optimization

dynamic constrained greedy decoding for DocIDscontinuous single-pass decoding to avoid multi-round I/O

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

KILT benchmark (used for evaluation)English Wikipedia dump 2019-08-01 (used as corpus)

Risks & Boundaries

Limitations

Evaluations are limited to KILT tasks and a Wikipedia corpus; cross-domain or live-web performance is untested.

Method relies on pre-defined DocIDs and a prefix-tree constraint; custom DocID design may be needed per corpus.

When Not To Use

When your knowledge source is highly dynamic (live web) and DocIDs cannot be pre-built.

If you require standard dense-index features (ANN search with external indexes) or existing search infrastructure you cannot replace.

Failure Modes

Generated DocIDs may miss relevant documents if DocID design or training data lacks coverage.

If retrieved passages are noisy or missing, the model can still hallucinate despite reference decoding.

Core Entities

Models

CorpusLMT5-BaseLlama2-7B

Metrics

R-PrecisionAccuracyExact MatchROUGE-LF1has_answer

Datasets

KILTWikipedia (2019-08-01 dump)

Benchmarks

KILT

Context Entities

Models

BM25DPRMT-DPRRAGE5SimLMSEALCorpusBrainFIDBART

Datasets

FEVERAIDA CoNLL-YAGO (AY2)WNED-WIKI (WnWi)WNED-CWEB (WnCw)T-RExZero Shot RE (zsRE)Natural Questions (NQ)HotpotQA (HoPo)TriviaQA (TQA)ELI5Wizard of Wikipedia (WoW)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CorpusLM improves passage retrieval R-Precision on KILT FEVER over a strong dense baseline (MT-DPR).

Continuous RAG with CorpusLM yields higher downstream accuracy on KILT FEVER than DPR+BART.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Case-aware LLM-as-a-judge scoring: eight enterprise metrics, severity-weighting, and JSON outputs for multi-turn RAG

Key finding

RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

Key finding

Curate systematic reviews + guidelines to make RAG answers more trustworthy for Long COVID

Key finding

Mask untruthful parts of context to cut hallucinations and keep helpful facts

Key finding

Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

Key finding