CorpusLM: unify generative retrieval and continuous RAG into one model

February 2, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

1

Authors

Xiaoxi Li, Zhicheng Dou, Yujia Zhou, Fangchao Liu

Links

Abstract / PDF

Why It Matters For Business

CorpusLM can replace a heavy index+reader stack with a single model that reduces storage and latency while improving factual retrieval and downstream answers on wiki-like corpora, lowering hosting and inference costs for knowledge-driven products.

Summary TLDR

CorpusLM is a single autoregressive model that combines generative retrieval (generate document IDs), closed-book generation, and a continuous retrieval-augmented generation (RAG) flow that decodes DocIDs, extracts fine-grained references, then outputs the answer. Key training ideas are ranking-oriented DocID list generation, unsupervised DocID understanding tasks, and a continuous DocIDs→references→answer decoding with noise sampling. Evaluated on KILT (11 datasets), CorpusLM (T5 and Llama2 backbones) improves retrieval R-Precision and downstream scores versus dense, sparse, and prior generative retrievers, while cutting storage and latency versus index-based RAG.

Problem Statement

Large LMs hallucinate on knowledge-intensive tasks. Traditional RAG uses a separate index and reader, costing memory and blocking end-to-end training. Generative retrieval (produce DocIDs) exists, but prior work rarely unifies retrieval and answer generation in one greedy decoding process or trains the model to rank DocIDs and use them as intermediate references.

Main Contribution

CorpusLM: a unified autoregressive model that performs generative retrieval, closed-book generation, and continuous RAG in one greedy decoding pass.

Ranking-oriented DocID list generation: train on ranked lists of DocIDs and use dynamic prefix-tree constraints to generate valid non-repetitive DocID lists.

Continuous DocIDs→References→Answer decoding plus unsupervised DocID-understanding tasks and noise sampling to improve end-to-end RAG quality and efficiency.

Key Findings

CorpusLM improves passage retrieval R-Precision on KILT FEVER over a strong dense baseline (MT-DPR).

NumbersFEVER R-Precision: CorpusLM (T5) 75.64 vs MT-DPR 64.05 (Δ+11.59 pp)

Continuous RAG with CorpusLM yields higher downstream accuracy on KILT FEVER than DPR+BART.

NumbersRAG accuracy (FEVER): CorpusLM (Llama2) 90.22 vs DPR+BART 88.11 (Δ+2.11 pp)

CorpusLM drastically reduces model storage and latency versus an index-based RAG setup.

NumbersParams/storage/latency: CorpusLM (T5) 220M / 426.1MB / 78.4ms vs RAG 626M / 59.3GB / 106.7ms

DocID understanding auxiliary tasks materially help retrieval ranking.

NumbersAblation: removing DocID tasks lowers T5 retrieval R-Precision on ELI5 by 18.42% (pp)

Decoding references and noise sampling improve RAG performance.

NumbersAblation: removing reference decoding drops NQ by 2.92 pp and HoPo by 4.54 pp (T5)

Results

Retrieval R-Precision (FEVER)

Value75.64%

BaselineMT-DPR 64.05%

Accuracy

Value90.22%

BaselineDPR+BART 88.11%

Model size / storage / latency

Value220M params / 426.1MB / 78.4ms

BaselineRAG: 626M / 59.3GB / 106.7ms

Who Should Care

What To Try In 7 Days

Run a small pilot: finetune CorpusLM on your FAQ/KB using its DocID list format and compare top-k retrieval quality to your current retriever.

Implement the continuous decode flow: generate DocIDs, short references, then answers to see if answers become more grounded.

Add DocID understanding tasks (generate summary from DocID) to your finetuning mix and measure retrieval ranking gains.

Agent Features

Memory

  • DocID-based retrieval (generative)

Optimization Features

Token Efficiency

  • decode short references before answers to reduce irrelevant context

Infra Optimization

  • smaller model footprint avoids large dense index storage

Model Optimization

  • LoRA

System Optimization

  • DeepSpeed for Llama2 finetuning

Training Optimization

  • multi-task finetuning across retrieval, closed-book, RAG and DocID tasks
  • noise sampling during reference training

Inference Optimization

  • dynamic constrained greedy decoding for DocIDs
  • continuous single-pass decoding to avoid multi-round I/O

Reproducibility

Data Urls

  • KILT benchmark (used for evaluation)
  • English Wikipedia dump 2019-08-01 (used as corpus)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluations are limited to KILT tasks and a Wikipedia corpus; cross-domain or live-web performance is untested.
  • Method relies on pre-defined DocIDs and a prefix-tree constraint; custom DocID design may be needed per corpus.
  • Paper does not release code in-text; reproduction needs engineering effort (QLoRA/DeepSpeed setup).

When Not To Use

  • When your knowledge source is highly dynamic (live web) and DocIDs cannot be pre-built.
  • If you require standard dense-index features (ANN search with external indexes) or existing search infrastructure you cannot replace.
  • When you cannot modify training data to produce ranked DocID lists or DocID summaries.

Failure Modes

  • Generated DocIDs may miss relevant documents if DocID design or training data lacks coverage.
  • If retrieved passages are noisy or missing, the model can still hallucinate despite reference decoding.
  • Prefix-tree constraints or decoding errors could produce invalid or truncated DocIDs in edge cases.

Core Entities

Models

  • CorpusLM
  • T5-Base
  • Llama2-7B

Metrics

  • R-Precision
  • Accuracy
  • Exact Match
  • ROUGE-L
  • F1
  • has_answer

Datasets

  • KILT
  • Wikipedia (2019-08-01 dump)

Benchmarks

  • KILT

Context Entities

Models

  • BM25
  • DPR
  • MT-DPR
  • RAG
  • E5
  • SimLM
  • SEAL
  • CorpusBrain
  • FID
  • BART

Datasets

  • FEVER
  • AIDA CoNLL-YAGO (AY2)
  • WNED-WIKI (WnWi)
  • WNED-CWEB (WnCw)
  • T-REx
  • Zero Shot RE (zsRE)
  • Natural Questions (NQ)
  • HotpotQA (HoPo)
  • TriviaQA (TQA)
  • ELI5
  • Wizard of Wikipedia (WoW)