Practical survey of retrieval-augmented generation (RAG): how retrievers, fusion methods, training and benchmarks fit together

Overview

Decision SnapshotNeeds Validation

This is a comprehensive survey synthesizing many empirical works; its practical guidance is strong but not itself a new model, so it scores high for usefulness and moderate for technical novelty.

Citations13

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 1/1

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 80%

Novelty: 25%

Authors

Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

Links

Abstract / PDF

Why It Matters For Business

RAG lets you add up-to-date, domain-specific facts without costly model retraining and reduces hallucinations by grounding outputs in external knowledge.

Who Should Care

CTO Product Manager ML Engineer Data Scientist CEO

Summary TLDR

This survey maps the RAG landscape: how to build retrievers (chunking, sparse vs dense embeddings, ANN indexes, vector datastores), three fusion families (query/text concatenation, logits-level fusion, latent/cross-attention fusion), and training options (with/without datastore updates). It lists common tools (FAISS, Milvus, LangChain, LlamaIndex), evaluation metrics/benchmarks, and practical trade-offs: query-based fusion is simple but long and slow; logits fusion is cheap but shallow; latent fusion is powerful but needs model changes and training. The paper includes tutorial code pointers and highlights future gaps: retrieval quality, efficiency, fusion choices, training sync, and multimod

Problem Statement

Large language models store knowledge in parameters but still hallucinate, lag on new facts, and lack domain expertise. Updating model weights is costly. RAG augments LLMs with an external knowledge store to reduce hallucinations, enable up-to-date knowledge, and provide domain adapters without full retraining.

Main Contribution

Systematic walkthrough of RAG components: retriever building, indexing, datastore design, and querying.

Categorization and tutorial code for three fusion families: query-based, logits-based, and latent (cross-attention/weighted) fusion.

Key Findings

Latent fusion (pretrained retrieval-enhanced models) can match much larger LLMs by using retrieval databases.

NumbersRETRO: 2T-token DB, performance comparable to GPT-3 with ~25x fewer parameters

Practical UseFor new systems, consider a retrieval-enhanced architecture (cross-attention) to get comparable factual performance with a smaller model; expect extra engineering and pretraining cost.

Evidence RefSection 4.3 (RETRO discussion)

Fusion method choice trades off simplicity, cost, and depth of integration.

Practical UseStart with query-based fusion for fast prototyping (works with black-box APIs). Move to logits fusion for cheaper inference budgets. Adopt latent fusion only if you can modify and fine-tune models.

Evidence RefSection 4.4 (comparison of fusion types)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
model-size-equivalent performance	RETRO with retrieval matched GPT-3-level results	GPT-3	used ~25x fewer parameters with 2T-token DB	reported by RETRO (Section 4.3)	RETRO scaling law: 2 trillion token DB gave GPT-3-comparable performance	Section 4.3

What To Try In 7 Days

Prototype query-based fusion using LangChain + FAISS and an embedding model (e.g., text-embedding-ada-002).

Build a small domain datastore: chunk docs, embed chunks, measure Precision@K and sample human checks.

Try two fusion modes: text concatenation for fast results, logits fusion for lower latency/compute, compare outputs on target tasks (QA or summarization).

Agent Features

Memory

external retrieval memory (key-value datastore)short-term context augmentation

Tool Use

web searchexternal datastore accessretrieval frameworks (LangChain)

Frameworks

LangChainLlamaIndex

Architectures

decoder-only LLMsencoder-decoder models with cross-attentionretrieval-enhanced transformers (RETRO-style)

Optimization Features

Token Efficiency

prompt templates and filtering retrieved contextsimportance weighting and retrieval filtering

Infra Optimization

use SSD-backed key-value stores (LMDB/RocksDB) and vector DBs (FAISS, Milvus)cache hot retrievals to reduce I/O

Model Optimization

LoRAdistillation/TinyBERT for encoder speedupsquantization for encoder efficiency (cited works)

System Optimization

product quantization (PQ), PCA, LSH for index compressionhardware-aware ANN (HNSW, IVFPQ) and system co-design (PipeRAG, RAGCache)

Training Optimization

asynchronous index refresh to reduce retraining costsjoint retriever-generator training with preselected candidate sets (REALM, RAG, Atlas)

Inference Optimization

FiD-Light and ReFusion for cheaper fusionlogits fusion to avoid long context inputs

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Survey-level work: no new experiments or datasets provided

Performance depends heavily on retrieval quality and index choices

When Not To Use

When strict low-latency (<100ms) is required and you cannot cache retrievals

When no reliable, high-quality external datastore exists

Failure Modes

Bad retrievals (irrelevant or contradictory) cause hallucinations

Stale datastore yields confidently wrong answers

Core Entities

Models

RETROkNN-LMREALMRAGAtlasFiDEnc-DeckNN-MTReFusion

Metrics

Precision@KRecall@KAccuracyF1Exact MatchPerplexity

Benchmarks

UDAARESLegalBench-RAGmedical RAG benchmarkRagbench

Context Entities

Models

BERTRoBERTaSimCSEtext-embedding-ada-002sentence-transformer

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Latent fusion (pretrained retrieval-enhanced models) can match much larger LLMs by using retrieval databases.

Fusion method choice trades off simplicity, cost, and depth of integration.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Benchmarks

Context Entities

Models

You May Also Want to Read

Fine-tune LLMs to ignore misleading retrieved documents and cut RAG hallucinations by ~21%

Key finding

17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Key finding

LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Key finding

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Key finding