Practical survey of retrieval-augmented generation (RAG): how retrievers, fusion methods, training and benchmarks fit together

July 18, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.25

Cost Impact Score

0.65

Citation Count

13

Authors

Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

Links

Abstract / PDF

Why It Matters For Business

RAG lets you add up-to-date, domain-specific facts without costly model retraining and reduces hallucinations by grounding outputs in external knowledge.

Summary TLDR

This survey maps the RAG landscape: how to build retrievers (chunking, sparse vs dense embeddings, ANN indexes, vector datastores), three fusion families (query/text concatenation, logits-level fusion, latent/cross-attention fusion), and training options (with/without datastore updates). It lists common tools (FAISS, Milvus, LangChain, LlamaIndex), evaluation metrics/benchmarks, and practical trade-offs: query-based fusion is simple but long and slow; logits fusion is cheap but shallow; latent fusion is powerful but needs model changes and training. The paper includes tutorial code pointers and highlights future gaps: retrieval quality, efficiency, fusion choices, training sync, and multimod

Problem Statement

Large language models store knowledge in parameters but still hallucinate, lag on new facts, and lack domain expertise. Updating model weights is costly. RAG augments LLMs with an external knowledge store to reduce hallucinations, enable up-to-date knowledge, and provide domain adapters without full retraining.

Main Contribution

Systematic walkthrough of RAG components: retriever building, indexing, datastore design, and querying.

Categorization and tutorial code for three fusion families: query-based, logits-based, and latent (cross-attention/weighted) fusion.

Survey of RAG training regimes: with and without datastore updates, and discussion of joint retriever-generator training.

Coverage of evaluation benchmarks, tasks, frameworks, applications (agents), and future research directions.

Key Findings

Latent fusion (pretrained retrieval-enhanced models) can match much larger LLMs by using retrieval databases.

NumbersRETRO: 2T-token DB, performance comparable to GPT-3 with ~25x fewer parameters

Fusion method choice trades off simplicity, cost, and depth of integration.

Retrieval quality largely controls downstream RAG performance; embedding choice, key design, similarity metric and ANN index all matter.

NumbersPrecision@K / Recall@K recommended for retrieval; studies link imperfect retrieval to hallucination

Results

model-size-equivalent performance

ValueRETRO with retrieval matched GPT-3-level results

BaselineGPT-3

Who Should Care

What To Try In 7 Days

Prototype query-based fusion using LangChain + FAISS and an embedding model (e.g., text-embedding-ada-002).

Build a small domain datastore: chunk docs, embed chunks, measure Precision@K and sample human checks.

Try two fusion modes: text concatenation for fast results, logits fusion for lower latency/compute, compare outputs on target tasks (QA or summarization).

Agent Features

Memory

  • external retrieval memory (key-value datastore)
  • short-term context augmentation

Tool Use

  • web search
  • external datastore access
  • retrieval frameworks (LangChain)

Frameworks

  • LangChain
  • LlamaIndex

Architectures

  • decoder-only LLMs
  • encoder-decoder models with cross-attention
  • retrieval-enhanced transformers (RETRO-style)

Optimization Features

Token Efficiency

  • prompt templates and filtering retrieved contexts
  • importance weighting and retrieval filtering

Infra Optimization

  • use SSD-backed key-value stores (LMDB/RocksDB) and vector DBs (FAISS, Milvus)
  • cache hot retrievals to reduce I/O

Model Optimization

  • LoRA
  • distillation/TinyBERT for encoder speedups
  • quantization for encoder efficiency (cited works)

System Optimization

  • product quantization (PQ), PCA, LSH for index compression
  • hardware-aware ANN (HNSW, IVFPQ) and system co-design (PipeRAG, RAGCache)

Training Optimization

  • asynchronous index refresh to reduce retraining costs
  • joint retriever-generator training with preselected candidate sets (REALM, RAG, Atlas)

Inference Optimization

  • FiD-Light and ReFusion for cheaper fusion
  • logits fusion to avoid long context inputs

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey-level work: no new experiments or datasets provided
  • Performance depends heavily on retrieval quality and index choices
  • Latent fusion needs architecture changes and large pretraining budgets
  • Query-based fusion can hit token limits and increase latency

When Not To Use

  • When strict low-latency (<100ms) is required and you cannot cache retrievals
  • When no reliable, high-quality external datastore exists
  • When data privacy forbids external retrieval or storage

Failure Modes

  • Bad retrievals (irrelevant or contradictory) cause hallucinations
  • Stale datastore yields confidently wrong answers
  • ANN compression or PQ can drop nearest-neighbor fidelity and hurt accuracy
  • Index rebuilds and async updates can desynchronize retriever and generator

Core Entities

Models

  • RETRO
  • kNN-LM
  • REALM
  • RAG
  • Atlas
  • FiD
  • Enc-Dec
  • kNN-MT
  • ReFusion

Metrics

  • Precision@K
  • Recall@K
  • Accuracy
  • F1
  • Exact Match
  • Perplexity

Benchmarks

  • UDA
  • ARES
  • LegalBench-RAG
  • medical RAG benchmark
  • Ragbench

Context Entities

Models

  • BERT
  • RoBERTa
  • SimCSE
  • text-embedding-ada-002
  • sentence-transformer