Survey of retrieval-augmented language models: architectures, retrievers, enhancements, and benchmarks

April 30, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.3

Cost Impact Score

0.55

Citation Count

4

Authors

Yucheng Hu, Yuxing Lu

Links

Abstract / PDF

Why It Matters For Business

Retrieval augmentation makes LMs more factual and updatable by combining model memory with external, searchable knowledge, improving performance on knowledge-heavy tasks while enabling incremental updates without full model retraining.

Summary TLDR

This 30-page survey defines Retrieval-Augmented Language Models (RALMs), covering both Retrieval-Augmented Generation (RAG) and Retrieval-Augmented Understanding (RAU). It organizes how retrievers and language models interact (three interaction modes), classifies retrievers (sparse, dense, internet, hybrid), summarizes LM families used, reviews improvements (retrieval quality, timing, end-to-end training), catalogs applications (QA, dialogue, translation, summarization, code, vision/audio), and lists evaluation suites and common failure modes (robustness, retrieval quality, cost). The paper links to a GitHub resource list.

Problem Statement

There is no single, practical overview that covers both retrieval-augmented generation and retrieval-augmented understanding, their interaction patterns, retriever types, enhancements, evaluations, and open problems; this survey aims to fill that gap with a structured taxonomy and recommendations.

Main Contribution

Defines RALM and precisely classifies three retriever–LM interaction modes: sequential single, sequential multiple, and parallel.

Systematically reviews retriever types (sparse, dense, internet, hybrid) and common LM families used in RALM pipelines.

Summarizes enhancement strategies (retrieval quality control, retrieval timing, LM structural tuning, end-to-end training) and evaluation benchmarks.

Identifies core limitations (robustness, retrieval quality, cost, limited application diversity) and suggests practical future directions.

Key Findings

There are three high-level ways a retriever and LM interact: sequential single, sequential multiple (iterative), and parallel.

Numbers3 interaction modes (Section 2)

Retrievers fall into four practical categories: sparse (TF-IDF/BM25), sparse vectors, dense (dual-encoder), internet, and hybrid combinations.

Numbers4 retrieval categories (Section 3)

Common failure modes are model distraction and lowered output quality when retrieved context is irrelevant or adversarial.

Evaluation has moved beyond generic benchmarks to RALM-specific suites covering faithfulness, context relevance, noise and counterfactual robustness (e.g., RAGAS, RGB, CRUD-RAG, RECALL, MIRAGE).

Who Should Care

What To Try In 7 Days

Add a BM25 baseline to an existing LM pipeline and compare outputs on 10 knowledge queries.

Plug a dense retriever (DPR) and measure retrieval relevance vs BM25 for your domain.

Implement a simple filter (lexical overlap or CXMI) before prompt augmentation and check error rate change.

Optimization Features

Token Efficiency

  • Increase retrieved paragraphs instead of model size (FiD approach)
  • Use reranking to focus top sources

Infra Optimization

  • Streamline indexes and reduce embedding dimensionality for faster kNN lookups

Model Optimization

  • Structural instruction tuning (FLAN-style)
  • FiD and FiD-Light reader optimizations
  • KNN-LM interpolation weight adaptation

System Optimization

  • Use internet search APIs for plug-and-play retrieval to avoid building full indices
  • Intermediate modules to avoid modifying black-box LLMs

Training Optimization

  • End-to-end retriever–reader training
  • Knowledge distillation for retriever updates
  • Instruction / command fine-tuning with retrieval

Inference Optimization

  • Retrieval timing (when to call retriever)
  • Prefix encoding to reduce runtime
  • Gating circuits to block irrelevant retrieved docs

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Poor robustness to adversarial or irrelevant retrieved context (prefix attacks and prompt perturbation).
  • Retrieval quality is uneven, especially when using raw internet sources without strong filtering.
  • High compute and storage costs for large indices, multi-document encoding, and end-to-end training.
  • Limited diversity of mature real-world applications beyond QA, summarization, dialogue and translation.
  • Evaluation blind spots: many datasets use LM-generated data or lack adversarial tests.

When Not To Use

  • When strict low-latency or minimal inference compute is required.
  • If your retrieval sources are untrusted or highly noisy without good filtering.
  • For small, well-covered tasks where model parametric memory already suffices.

Failure Modes

  • Irrelevant or misleading retrieved documents degrade generation quality.
  • Prompt injection or prefix attacks alter retriever outputs or LM behavior.
  • Overfitting to retrieval corpus when using LM-generated training corpora.
  • Excessive cost from indexing and repeated retrieval calls.

Core Entities

Models

  • RAG
  • REALM
  • FiD
  • KNN-LM
  • DPR
  • Contriever
  • ColBERT
  • BART
  • T5
  • BERT
  • GPT-3/3.5/4
  • Llama/Llama2
  • SELF-RAG
  • Selfmem
  • FILCO

Metrics

  • ROUGE
  • BLEU
  • bertScore
  • Accuracy
  • Faithfulness / Context Relevance
  • Noise robustness
  • Counterfactual robustness

Datasets

  • Wikipedia / KILT
  • HotpotQA
  • Natural Questions (NQ)
  • FEVER
  • CNN/DailyMail
  • XSum
  • BigPatent
  • IWSLT14 De-En
  • StrategyQA
  • MMLU-Med

Benchmarks

  • KILT
  • SuperGLUE
  • RAGAS
  • RGB
  • CRUD-RAG
  • ARES
  • MIRAGE
  • RECALL

Context Entities

Models

  • RETOMATON
  • FiD-Light
  • ADAPTRET
  • TRIME
  • RE-IMAGEN
  • RDM
  • REPLUG

Metrics

  • FID (image)
  • BLEU/ROUGE (text)
  • RAGQuestEval

Datasets

  • KILT/Wizard of Wikipedia
  • COCO
  • CUB
  • CodeXGLUE
  • MedMC-QA
  • EventKG

Benchmarks

  • EntityDrawBench
  • AESLC
  • AG News
  • Gigaword