Overview
Production Readiness
0.6
Novelty Score
0.3
Cost Impact Score
0.55
Citation Count
4
Why It Matters For Business
Retrieval augmentation makes LMs more factual and updatable by combining model memory with external, searchable knowledge, improving performance on knowledge-heavy tasks while enabling incremental updates without full model retraining.
Summary TLDR
This 30-page survey defines Retrieval-Augmented Language Models (RALMs), covering both Retrieval-Augmented Generation (RAG) and Retrieval-Augmented Understanding (RAU). It organizes how retrievers and language models interact (three interaction modes), classifies retrievers (sparse, dense, internet, hybrid), summarizes LM families used, reviews improvements (retrieval quality, timing, end-to-end training), catalogs applications (QA, dialogue, translation, summarization, code, vision/audio), and lists evaluation suites and common failure modes (robustness, retrieval quality, cost). The paper links to a GitHub resource list.
Problem Statement
There is no single, practical overview that covers both retrieval-augmented generation and retrieval-augmented understanding, their interaction patterns, retriever types, enhancements, evaluations, and open problems; this survey aims to fill that gap with a structured taxonomy and recommendations.
Main Contribution
Defines RALM and precisely classifies three retriever–LM interaction modes: sequential single, sequential multiple, and parallel.
Systematically reviews retriever types (sparse, dense, internet, hybrid) and common LM families used in RALM pipelines.
Summarizes enhancement strategies (retrieval quality control, retrieval timing, LM structural tuning, end-to-end training) and evaluation benchmarks.
Identifies core limitations (robustness, retrieval quality, cost, limited application diversity) and suggests practical future directions.
Key Findings
There are three high-level ways a retriever and LM interact: sequential single, sequential multiple (iterative), and parallel.
Retrievers fall into four practical categories: sparse (TF-IDF/BM25), sparse vectors, dense (dual-encoder), internet, and hybrid combinations.
Common failure modes are model distraction and lowered output quality when retrieved context is irrelevant or adversarial.
Evaluation has moved beyond generic benchmarks to RALM-specific suites covering faithfulness, context relevance, noise and counterfactual robustness (e.g., RAGAS, RGB, CRUD-RAG, RECALL, MIRAGE).
Who Should Care
What To Try In 7 Days
Add a BM25 baseline to an existing LM pipeline and compare outputs on 10 knowledge queries.
Plug a dense retriever (DPR) and measure retrieval relevance vs BM25 for your domain.
Implement a simple filter (lexical overlap or CXMI) before prompt augmentation and check error rate change.
Optimization Features
Token Efficiency
- Increase retrieved paragraphs instead of model size (FiD approach)
- Use reranking to focus top sources
Infra Optimization
- Streamline indexes and reduce embedding dimensionality for faster kNN lookups
Model Optimization
- Structural instruction tuning (FLAN-style)
- FiD and FiD-Light reader optimizations
- KNN-LM interpolation weight adaptation
System Optimization
- Use internet search APIs for plug-and-play retrieval to avoid building full indices
- Intermediate modules to avoid modifying black-box LLMs
Training Optimization
- End-to-end retriever–reader training
- Knowledge distillation for retriever updates
- Instruction / command fine-tuning with retrieval
Inference Optimization
- Retrieval timing (when to call retriever)
- Prefix encoding to reduce runtime
- Gating circuits to block irrelevant retrieved docs
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Poor robustness to adversarial or irrelevant retrieved context (prefix attacks and prompt perturbation).
- Retrieval quality is uneven, especially when using raw internet sources without strong filtering.
- High compute and storage costs for large indices, multi-document encoding, and end-to-end training.
- Limited diversity of mature real-world applications beyond QA, summarization, dialogue and translation.
- Evaluation blind spots: many datasets use LM-generated data or lack adversarial tests.
When Not To Use
- When strict low-latency or minimal inference compute is required.
- If your retrieval sources are untrusted or highly noisy without good filtering.
- For small, well-covered tasks where model parametric memory already suffices.
Failure Modes
- Irrelevant or misleading retrieved documents degrade generation quality.
- Prompt injection or prefix attacks alter retriever outputs or LM behavior.
- Overfitting to retrieval corpus when using LM-generated training corpora.
- Excessive cost from indexing and repeated retrieval calls.
Core Entities
Models
- RAG
- REALM
- FiD
- KNN-LM
- DPR
- Contriever
- ColBERT
- BART
- T5
- BERT
- GPT-3/3.5/4
- Llama/Llama2
- SELF-RAG
- Selfmem
- FILCO
Metrics
- ROUGE
- BLEU
- bertScore
- Accuracy
- Faithfulness / Context Relevance
- Noise robustness
- Counterfactual robustness
Datasets
- Wikipedia / KILT
- HotpotQA
- Natural Questions (NQ)
- FEVER
- CNN/DailyMail
- XSum
- BigPatent
- IWSLT14 De-En
- StrategyQA
- MMLU-Med
Benchmarks
- KILT
- SuperGLUE
- RAGAS
- RGB
- CRUD-RAG
- ARES
- MIRAGE
- RECALL
Context Entities
Models
- RETOMATON
- FiD-Light
- ADAPTRET
- TRIME
- RE-IMAGEN
- RDM
- REPLUG
Metrics
- FID (image)
- BLEU/ROUGE (text)
- RAGQuestEval
Datasets
- KILT/Wizard of Wikipedia
- COCO
- CUB
- CodeXGLUE
- MedMC-QA
- EventKG
Benchmarks
- EntityDrawBench
- AESLC
- AG News
- Gigaword

