Overview
Production Readiness
0.7
Novelty Score
0.65
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
CorpusLM can replace a heavy index+reader stack with a single model that reduces storage and latency while improving factual retrieval and downstream answers on wiki-like corpora, lowering hosting and inference costs for knowledge-driven products.
Summary TLDR
CorpusLM is a single autoregressive model that combines generative retrieval (generate document IDs), closed-book generation, and a continuous retrieval-augmented generation (RAG) flow that decodes DocIDs, extracts fine-grained references, then outputs the answer. Key training ideas are ranking-oriented DocID list generation, unsupervised DocID understanding tasks, and a continuous DocIDs→references→answer decoding with noise sampling. Evaluated on KILT (11 datasets), CorpusLM (T5 and Llama2 backbones) improves retrieval R-Precision and downstream scores versus dense, sparse, and prior generative retrievers, while cutting storage and latency versus index-based RAG.
Problem Statement
Large LMs hallucinate on knowledge-intensive tasks. Traditional RAG uses a separate index and reader, costing memory and blocking end-to-end training. Generative retrieval (produce DocIDs) exists, but prior work rarely unifies retrieval and answer generation in one greedy decoding process or trains the model to rank DocIDs and use them as intermediate references.
Main Contribution
CorpusLM: a unified autoregressive model that performs generative retrieval, closed-book generation, and continuous RAG in one greedy decoding pass.
Ranking-oriented DocID list generation: train on ranked lists of DocIDs and use dynamic prefix-tree constraints to generate valid non-repetitive DocID lists.
Continuous DocIDs→References→Answer decoding plus unsupervised DocID-understanding tasks and noise sampling to improve end-to-end RAG quality and efficiency.
Key Findings
CorpusLM improves passage retrieval R-Precision on KILT FEVER over a strong dense baseline (MT-DPR).
Continuous RAG with CorpusLM yields higher downstream accuracy on KILT FEVER than DPR+BART.
CorpusLM drastically reduces model storage and latency versus an index-based RAG setup.
DocID understanding auxiliary tasks materially help retrieval ranking.
Decoding references and noise sampling improve RAG performance.
Results
Retrieval R-Precision (FEVER)
Accuracy
Model size / storage / latency
Who Should Care
What To Try In 7 Days
Run a small pilot: finetune CorpusLM on your FAQ/KB using its DocID list format and compare top-k retrieval quality to your current retriever.
Implement the continuous decode flow: generate DocIDs, short references, then answers to see if answers become more grounded.
Add DocID understanding tasks (generate summary from DocID) to your finetuning mix and measure retrieval ranking gains.
Agent Features
Memory
- DocID-based retrieval (generative)
Optimization Features
Token Efficiency
- decode short references before answers to reduce irrelevant context
Infra Optimization
- smaller model footprint avoids large dense index storage
Model Optimization
- LoRA
System Optimization
- DeepSpeed for Llama2 finetuning
Training Optimization
- multi-task finetuning across retrieval, closed-book, RAG and DocID tasks
- noise sampling during reference training
Inference Optimization
- dynamic constrained greedy decoding for DocIDs
- continuous single-pass decoding to avoid multi-round I/O
Reproducibility
Data Urls
- KILT benchmark (used for evaluation)
- English Wikipedia dump 2019-08-01 (used as corpus)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Evaluations are limited to KILT tasks and a Wikipedia corpus; cross-domain or live-web performance is untested.
- Method relies on pre-defined DocIDs and a prefix-tree constraint; custom DocID design may be needed per corpus.
- Paper does not release code in-text; reproduction needs engineering effort (QLoRA/DeepSpeed setup).
When Not To Use
- When your knowledge source is highly dynamic (live web) and DocIDs cannot be pre-built.
- If you require standard dense-index features (ANN search with external indexes) or existing search infrastructure you cannot replace.
- When you cannot modify training data to produce ranked DocID lists or DocID summaries.
Failure Modes
- Generated DocIDs may miss relevant documents if DocID design or training data lacks coverage.
- If retrieved passages are noisy or missing, the model can still hallucinate despite reference decoding.
- Prefix-tree constraints or decoding errors could produce invalid or truncated DocIDs in edge cases.
Core Entities
Models
- CorpusLM
- T5-Base
- Llama2-7B
Metrics
- R-Precision
- Accuracy
- Exact Match
- ROUGE-L
- F1
- has_answer
Datasets
- KILT
- Wikipedia (2019-08-01 dump)
Benchmarks
- KILT
Context Entities
Models
- BM25
- DPR
- MT-DPR
- RAG
- E5
- SimLM
- SEAL
- CorpusBrain
- FID
- BART
Datasets
- FEVER
- AIDA CoNLL-YAGO (AY2)
- WNED-WIKI (WnWi)
- WNED-CWEB (WnCw)
- T-REx
- Zero Shot RE (zsRE)
- Natural Questions (NQ)
- HotpotQA (HoPo)
- TriviaQA (TQA)
- ELI5
- Wizard of Wikipedia (WoW)

