Basic RAG Papers — Parsed & Scored for Practitioners

Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

0.70

0.30

0.60

612

RAG lets you keep LLMs current and auditable by fetching external facts at inference time; this reduces hallucinations and speeds updates without retraining the base model.

Key finding

Surveyed RAG work covers a broad task and dataset space.

Numbers: 26 tasks; ~50 datasets

RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

0.40

0.30

0.25

52

RAG can improve factuality, but retrieved noise and false facts cause wrong outputs and missed refusals, risking user trust and legal/brand exposure in production.

Key finding

Adding noisy retrieved documents lowers answer accuracy for all tested LLMs.

Numbers: ChatGPT accuracy 96.33% → 76.00% (noise ratio 0→0.8)

Practical survey of what makes LLMs factual, how we test it, and how to fix it

0.60

0.40

0.60

52

LLMs are useful but make verifiable mistakes; businesses must add retrieval, verification, or domain tuning before using LLM outputs in advice, legal, medical, or financial workflows.

Key finding

Off-the-shelf LLMs often have low factual precision on long-form biographical text.

Numbers: FActScore range 42%–71% for commercial LLMs on biographies

Find and fix contradictions in an LLM's own text without web lookups

0.70

0.60

46

Automate contradiction checks to catch internal hallucinations that retrieval misses, improving trust in long-form outputs and answers with modest extra cost.

Key finding

Self-contradictions are common in open-domain generations.

Numbers: 17.7% of sentences for ChatGPT (MainTestSet)

Top legal AI tools still hallucinate: 17–33% of answers are false or misleading

0.45

0.25

0.30

32

Major legal AI products still produce false or misleading legal claims often enough that lawyers must verify outputs, which affects liability, trust, and the realized efficiency gains.

Key finding

Lexis+ AI provided accurate (correct + grounded) answers for 65% of queries.

Numbers: 65% accuracy (Figure 4; Section 6.1)

Fine-tune a Chinese 13B LLM with legal syllogism data plus retrieval to build a practical legal assistant and benchmark

0.50

24

Fine-tuning a mid-size Chinese LLM with focused legal instruction data and a small retrieval KB yields measurable gains in legal QA and advice; this reduces manual review and makes legal tools more practical.

Key finding

Large, law-specific SFT dataset built for training.

Numbers: DISC-Law-SFT total size 403K samples

Using a targeted RAG pipeline and curated CMU dataset to reduce LLM hallucinations on domain queries

0.30

0.40

0.50

19

Connecting an LLM to a curated domain knowledge base (RAG) gives measurable factual gains and is a practical first step before costly generator finetuning.

Key finding

Adding RAG boosts retrieval and answer quality over the baseline LLM.

Numbers: Recall 0.361 -> 0.409; F1 0.186 -> 0.289

LLM4Vuln + UniVul: separate an LLM's reasoning from retrieved knowledge, context, and prompts to measure real vulnerability-detection skill

0.60

0.70

0.50

17

LLM4Vuln helps teams know whether an LLM truly reasons about vulnerabilities or just repeats retrieved knowledge; this prevents wasted engineering on useless retrievals and guides model+tool choices for auditing code and triage.

Key finding

Knowledge retrieval helps foundation models on logic-heavy Solidity but not uniformly elsewhere.

Numbers: F1 for traditional foundation models on Solidity nearly doubled on average with knowledge

Use retrieved similar programs and generated test cases in prompts to boost code-generation accuracy

0.60

0.50

16

AceCoder raises the chance that a single generated program is correct without fine-tuning, so teams can get better automated code suggestions cheaply if they can supply or index similar existing code.

Key finding

AceCoder markedly increases strict execution accuracy (Pass@1) over few-shot prompting on public benchmarks.

Numbers: Pass@1 +56.4% (MBPP); +70.7% (MBJP); +88.4% (MBJSP)

ReWOO separates planning from fetching evidence to cut repeating prompt tokens and run smaller models

0.70

0.60

0.80

15

ReWOO cuts API token usage and hosting cost by separating planning from tool calls, so multi-step tool-using pipelines can run cheaper and scale with smaller models.

Key finding

ReWOO reduces token use on HotpotQA by about 5× compared to an observation-dependent ALM (ReAct).

Numbers: ReAct 9795.1 tokens vs ReWOO 1986.2 tokens (HotpotQA)

Practical survey of retrieval-augmented generation (RAG): how retrievers, fusion methods, training and benchmarks fit together

0.80

0.25

0.65

13

RAG lets you add up-to-date, domain-specific facts without costly model retraining and reduces hallucinations by grounding outputs in external knowledge.

Key finding

Latent fusion (pretrained retrieval-enhanced models) can match much larger LLMs by using retrieval databases.

Numbers: RETRO: 2T-token DB, performance comparable to GPT-3 with ~25x fewer parameters

FINANCEBENCH: 10,231 open-book financial QA cases to stress-test LLMs

0.30

0.40

0.25

11

Out-of-the-box LLMs often fail on firm-specific financial questions. Firms must validate retrieval, prompt order, and verification steps before trusting outputs in decisions.

Key finding

FINANCEBENCH contains 10,231 curated QA-evidence triplets.

Numbers: 10,231 cases; 360 documents; 40 companies

Jointly train retriever and medical LLM to improve accuracy, reduce hallucinations, and cut training cost

0.60

0.80

11

Joint retriever+LLM fine-tuning yields better medical QA accuracy and explanations while cutting training compute by orders of magnitude versus large-domain pretraining, making domain-specialized models cheaper and faster to build.

Key finding

JMLR-13B achieves the highest reported average accuracy across evaluated medical QA sets.

Numbers: Avg accuracy 70.5% (JMLR-13B) vs 68.9% (Meditron-70B)

A 7B cancer-specialized LLM that matches or beats larger models on phenotype extraction and diagnosis generation

0.60

0.45

0.75

11

CancerLLM shows that a domain-tuned 7B model can reach or exceed larger models on cancer tasks while using far less GPU memory, lowering operational cost for hospitals and clinics.

Key finding

CancerLLM achieves state-of-the-art average F1 on diagnosis generation among evaluated models.

Numbers: Diagnosis average F1 = 86.81% (Table 1)

CrossCodeEval: 10k multilingual examples that force models to read other files to complete code

0.70

0.60

10

Real-world code completions often need other files; adding a retrieval step can roughly double correct completions and should be part of any practical code-assist product pipeline.

Key finding

Off-the-shelf models fail on cross-file examples when only given the current file.

Numbers: StarCoder-15.5B Python EM 8.82% (in-file only)

A practical survey and benchmark that measures factuality, robustness, fairness, transparency, accountability and privacy in RAG systems.

0.40

0.30

9

RAG systems can improve factual answers but also introduce privacy leaks, bias and brittle behavior; measuring those risks with a practical benchmark helps choose models and safeguards before production.

Key finding

Proprietary models outperform most open-source models on trustworthiness metrics.

Numbers: GPT-3.5 factuality=40 vs Llama2-13b-chat=4 (Table 2)

A retrieval+claim-verification pipeline that cuts hallucinations and can be distilled to a fast 7B model

0.80

0.50

0.70

9

Grounding LLM responses in a trusted corpus plus claim-level verification cuts hallucinations dramatically. That reduces misinformation risk, improves user trust, and enables deployment of smaller student models locally for lower cost and better privacy.

Key finding

WikiChat (GPT-4 teacher) achieves high factual accuracy on evaluated conversations.

Numbers: 97.3% factual accuracy (simulated 'All')

Survey of practical methods to improve reasoning in large language models

0.60

0.40

0.50

8

Better reasoning reduces wrong conclusions, lowers downstream verification cost, and enables LLMs to be used in higher-stakes workflows like finance, legal, and scientific support.

Key finding

Chain-of-Thought prompting helps multi-step problems by making the model emit intermediate steps.

Survey: How to add, update, and use external knowledge with large language models

0.60

0.50

0.60

8

Keeping LLMs accurate saves user trust and legal risk: use prompt/input edits for cheap, fast fixes, model editing for durable updates, and retrieval for up-to-date answers when models show low confidence.

Key finding

Most knowledge-editing evaluations focus on triple-fact QA benchmarks like ZsRE and CounterFact.

Numbers: ZsRE: 182,282; CounterFact: 21,919

RAGBench: 100k explainable RAG examples plus TRACe — practical metrics to audit retriever+generator systems

0.70

0.55

0.60

8

RAGBench + TRACe gives a unified, explainable way to audit retriever and generator components, reducing costly trial-and-error and surfacing whether errors come from the retriever, the generator, or both.

Key finding

RAGBench totals approximately 100k labeled RAG examples.

Numbers: 100k total; Train 78k / Val 12k / Test 11k

An LLM trading agent that uses working + layered long-term memory and a dynamic trader profile to beat standard baselines on backtests

0.50

0.70

0.60

7

FINMEM shows LLM agents with structured, time-aware memory can produce better risk-adjusted returns in backtests while using shorter training histories—helpful for trading newer stocks or fast deployment.

Key finding

FINMEM achieved the highest backtested cumulative return and risk-adjusted performance across tested stocks.

Numbers: TSLA cumulative return = 61.78%, Sharpe = 2.6789 (Table 2)

A small GPT‑2 with recurrent memory reads 11 million tokens and finds facts big LLMs miss

0.60

0.70

0.60

7

When you must locate rare facts across very long documents, memory‑augmented models scale better and are cheaper than relying on huge LLM windows or naive RAG, so consider memory models for long‑document search and auditing.

Key finding

Recurrent memory model processes record-length inputs.

Numbers: Processed up to ~11,000,000 tokens (paper claims 11M)

Large-scale tests show where hallucinations come from, when common fixes help, and when they backfire

0.60

0.50

0.60

7

Hallucinations cause real-world harm (wrong facts, bad decisions). The paper gives practical, tested levers—retrieve relevant docs, apply RLHF, tune instruction mix, and be careful with quantization and aggressive sampling—so teams can reduce factual errors quickly.

Key finding

The GPT-4 based two-step detector (fact extraction + fact judgement) matches human labels at high rates.

Numbers: Agreement 91.5%–94.7% across five domains

CRUD-RAG: a Chinese benchmark testing RAG across Create / Read / Update / Delete tasks

0.70

0.60

0.70

7

CRUD-RAG helps teams tune the full RAG stack (indexing, retriever, prompt, model) for realistic production tasks and trade accuracy vs recall — saving compute and reducing hallucinations.

Key finding

Chunk size strongly changes task behavior.

Numbers: Continuation BLEU 3.42 (64) → 5.12 (512); RAGQuestEval recall 23.39% → 28.27% (same rows)