Overview
The paper shows consistent improvements across 11 KILT datasets and reports concrete efficiency gains; ablations identify which parts matter. Evidence is empirical on Wikipedia+KILT; expect extra integration work for non-Wikipedia or live corpora.
Citations1
Evidence Strength0.80
Confidence0.82
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 65%
Why It Matters For Business
CorpusLM can replace a heavy index+reader stack with a single model that reduces storage and latency while improving factual retrieval and downstream answers on wiki-like corpora, lowering hosting and inference costs for knowledge-driven products.
Who Should Care
Summary TLDR
CorpusLM is a single autoregressive model that combines generative retrieval (generate document IDs), closed-book generation, and a continuous retrieval-augmented generation (RAG) flow that decodes DocIDs, extracts fine-grained references, then outputs the answer. Key training ideas are ranking-oriented DocID list generation, unsupervised DocID understanding tasks, and a continuous DocIDs→references→answer decoding with noise sampling. Evaluated on KILT (11 datasets), CorpusLM (T5 and Llama2 backbones) improves retrieval R-Precision and downstream scores versus dense, sparse, and prior generative retrievers, while cutting storage and latency versus index-based RAG.
Problem Statement
Large LMs hallucinate on knowledge-intensive tasks. Traditional RAG uses a separate index and reader, costing memory and blocking end-to-end training. Generative retrieval (produce DocIDs) exists, but prior work rarely unifies retrieval and answer generation in one greedy decoding process or trains the model to rank DocIDs and use them as intermediate references.
Main Contribution
CorpusLM: a unified autoregressive model that performs generative retrieval, closed-book generation, and continuous RAG in one greedy decoding pass.
Ranking-oriented DocID list generation: train on ranked lists of DocIDs and use dynamic prefix-tree constraints to generate valid non-repetitive DocID lists.
Key Findings
CorpusLM improves passage retrieval R-Precision on KILT FEVER over a strong dense baseline (MT-DPR).
Continuous RAG with CorpusLM yields higher downstream accuracy on KILT FEVER than DPR+BART.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Retrieval R-Precision (FEVER) | 75.64% | MT-DPR 64.05% | +11.59 pp | KILT dev (FEVER) | Table 2 shows CorpusLM (T5) 75.64 vs MT-DPR 64.05 | Table 2 |
| Accuracy | 90.22% | DPR+BART 88.11% | +2.11 pp | KILT dev (FEVER) RAG setting | Table 3 shows CorpusLM (Llama2) 90.22 vs DPR+BART 88.11 | Table 3 |
What To Try In 7 Days
Run a small pilot: finetune CorpusLM on your FAQ/KB using its DocID list format and compare top-k retrieval quality to your current retriever.
Implement the continuous decode flow: generate DocIDs, short references, then answers to see if answers become more grounded.
Add DocID understanding tasks (generate summary from DocID) to your finetuning mix and measure retrieval ranking gains.
Agent Features
Memory
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Evaluations are limited to KILT tasks and a Wikipedia corpus; cross-domain or live-web performance is untested.
Method relies on pre-defined DocIDs and a prefix-tree constraint; custom DocID design may be needed per corpus.
When Not To Use
When your knowledge source is highly dynamic (live web) and DocIDs cannot be pre-built.
If you require standard dense-index features (ANN search with external indexes) or existing search infrastructure you cannot replace.
Failure Modes
Generated DocIDs may miss relevant documents if DocID design or training data lacks coverage.
If retrieved passages are noisy or missing, the model can still hallucinate despite reference decoding.

