Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

December 18, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.3

Cost Impact Score

0.6

Citation Count

612

Authors

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang

Links

Abstract / PDF

Why It Matters For Business

RAG lets you keep LLMs current and auditable by fetching external facts at inference time; this reduces hallucinations and speeds updates without retraining the base model.

Summary TLDR

This paper surveys Retrieval-Augmented Generation (RAG) for large language models. It organizes RAG into three practical paradigms (Naive, Advanced, Modular), and breaks down the technical stack across retrieval, generation, and augmentation. The survey catalogs retrieval sources, indexing and query tricks, embedding and reranking methods, iterative/adaptive retrieval patterns, evaluation tasks/benchmarks, and engineering challenges (robustness, long-context tradeoffs, production tooling). The authors provide a compact evaluation map and a GitHub resource list.

Problem Statement

LLMs are powerful but make factual errors, go out of date, and hide their evidence trail. Research on retrieval augmentation is scattered. Practitioners need a unified view of RAG methods, components, evaluation practices, and production challenges.

Main Contribution

Systematic review of RAG research organized into Naive, Advanced, and Modular paradigms.

Detailed analysis of the three core RAG stages: Retrieval, Generation, and Augmentation.

Compilation of downstream tasks, ~50 datasets, benchmarks, and evaluation objectives, plus a discussion of open challenges and directions.

Key Findings

Surveyed RAG work covers a broad task and dataset space.

Numbers26 tasks; ~50 datasets

RAG often beats unsupervised fine-tuning on knowledge updates.

Including irrelevant documents can sometimes increase accuracy.

Numbersaccuracy +30% (reported example)

LLMs now accept very long contexts, changing RAG tradeoffs.

Numbers>200,000 tokens

Who Should Care

What To Try In 7 Days

Build a simple RAG QA pipeline: chunk docs, create embeddings, run nearest-neighbor retrieval, and feed top-k snippets to an LLM.

Add a light reranker or LLM-based filter to improve context relevance before generation.

Measure hit rate/MRR and compare one-shot retrieval vs. a small iterative retrieval loop on a key task.

Agent Features

Memory

  • Retrieval Memory (external KB)
  • LLM Self-memory modules

Planning

  • Iterative Retrieve-Generate loops
  • Recursive query decomposition

Tool Use

  • Search engines and vector DBs
  • LLM-generated queries (HyDE)

Frameworks

  • LlamaIndex
  • LangChain
  • HayStack

Architectures

  • Naive RAG
  • Advanced RAG
  • Modular RAG

Collaboration

  • Retriever-generator alignment training

Optimization Features

Token Efficiency

  • Context compression via small LM compressors
  • Sliding-window and Small2Big chunking

Infra Optimization

  • Vector DB indexing strategies
  • Hierarchical indices and KG-backed indexes

Model Optimization

  • Retriever fine-tuning
  • Adapter layers for retriever/generator alignment

System Optimization

  • Hybrid sparse+dense retrieval
  • Metadata routing to narrow search scope

Training Optimization

  • LM-supervised retriever (LSR)
  • Contrastive learning for compressors
  • KL alignment between retriever and generator

Inference Optimization

  • Reranking and filter-reranker patterns
  • Adaptive retrieval triggers (confidence thresholds)
  • Token elimination and prompt compression

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Retrieval noise and irrelevant documents can still break generation quality.
  • Handling semi-structured data (tables, PDFs) is immature and error-prone.
  • Evaluation metrics for RAG aspects (faithfulness, integration) are not standardized.
  • Knowledge graphs give precision but incur build/maintenance cost.

When Not To Use

  • Ultra-low-latency or very high throughput systems where retrieval latency is unacceptable.
  • Tasks that require no external knowledge and can be solved by the base LLM.
  • Environments with strict data exposure rules where retrieved sources may leak private data.

Failure Modes

  • Hallucination despite retrieved context (generator ignores evidence).
  • Over-reliance on retrieved text leading to verbatim echoing.
  • Retriever misses critical documents (low recall) and yields wrong answers.
  • Data leakage or provenance errors exposing sources or metadata.

Core Entities

Models

  • RETRO++
  • InstructRETRO
  • REPLUG
  • Self-RAG
  • FLARE
  • RECITE
  • RAG-Robust

Metrics

  • Accuracy
  • EM
  • F1
  • Hit Rate
  • MRR
  • NDCG
  • BLEU
  • ROUGE-L

Datasets

  • NaturalQuestions
  • TriviaQA
  • SQuAD
  • HotpotQA
  • ELI5
  • ARXIV/PubMed (PaperQA examples)
  • MSMARCO

Benchmarks

  • RGB
  • RECALL
  • CRUD
  • RAGAS
  • ARES
  • RALLE

Context Entities

Models

  • BERT-based retrievers
  • Dense retrievers (DPR-style)
  • Sparse retrievers (BM25)

Metrics

  • R-Rate (reappearance rate)
  • BertScore

Datasets

  • HotpotQA
  • DPR Wikipedia
  • C4 (pretraining examples)

Benchmarks

  • MTEB (embedding leaderboard)
  • C-MTEB (Chinese)