RAG helps up to ~10–15 context snippets; model and retriever choice strongly shape results

Overview

Decision SnapshotReady For Pilot

The paper delivers directly usable guidance (context size, retriever baseline, domain-specific model choice) backed by experiments on two real datasets, but findings are limited to the evaluated models, metrics, and zero-shot setup.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 40%

Authors

Juraj Vladika, Florian Matthes

Links

Abstract / PDF / Code / Data

Why It Matters For Business

When building a RAG product, supplying ~10–15 curated snippets gives the best return: more context adds cost and can add noise. Retrieval quality and reader model must be tuned to your domain.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

This empirical study measures how many retrieved context snippets to give a language model, which retriever to use, and which base model to pick for long-form QA. Using two datasets (BioASQ biomedical and QuoteSum encyclopedic) and eight LLMs, the authors find: adding context improves answers up to roughly 10–15 snippets, then plateaus or declines; model choice matters by domain (Mixtral/Mistral/Qwen beat others on biomedical; GPT/LLaMa better on encyclopedic); and open-domain retrieval is much harder than using gold evidence, with BM25 slightly outperforming semantic search on PubMed.

Problem Statement

RAG systems have many moving parts (how much context to pass, which retriever, which reader model). Prior work mostly used short factoid QA and assumed a single gold snippet. We lack systematic guidance for long-form QA where answers must combine multiple snippets.

Main Contribution

Systematic sweep of context size (0,1,3,5,10,15,20,30 snippets) for long-form QA.

Comparative evaluation of two retrievers (BM25 sparse, semantic dense) in closed and open retrieval.

Key Findings

Adding more context boosts QA performance until about 10–15 snippets, then gains stop or reverse.

NumbersMixtral BioASQ entailment: 0→10 snippets 29.4%→50.7% (+21.3pp); open-retrieval stalls after 15–20 snippets.

Practical UseSupply roughly 10–15 high-quality snippets to a reader LLM; adding many more risks noise and no extra benefit.

Evidence RefTables 1 & 4; Figures and discussion in §5–§6

Model choice shifts best performance by domain: some models excel in biomedicine, others in encyclopedic QA.

NumbersBioASQ (Ent.% at 10 snippets): Mixtral 50.7% vs GPT-3.5 32.6% and LLaMa 34.5%; QuoteSum (Ent.% at 10): GPT-3.5 44.2% vs.

Practical UsePick your reader model to match domain: test Mixtral/Mistral/Qwen for biomedical tasks and GPT/LLaMa for encyclopedic tasks before deployment.

Evidence RefTable 1 (BioASQ + QuoteSum) and §5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Context-size effect (entailment)	Mixtral BioASQ Ent% 0→10: 29.4%→50.7%	zero-context (0 snippets)	+21.3pp	BioASQ (gold snippets)	Table 1 (Mixtral Ent.% values)	Table 1
Saturation and decline	Open PubMed (Mixtral BM25) Ent% at 10,15,30: 28.9%→31.1%→31.6%	10 snippets	small change, no steady gains beyond 15	BioASQ (open retrieval, PubMed)	Table 4; §6.3	Table 4

What To Try In 7 Days

Run a small A/B: 5 vs 15 snippets on real queries and compare answer quality.

Compare BM25 and your dense retriever on a domain crawl; prefer BM25 if queries are keyword-heavy.

Benchmark 2 reader LLMs (one open, one commercial) on a 100-question slice per domain to pick the best match.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/jvladika/ContextRAG

Data URLs

https://www.nlm.nih.gov/databases/download/pubmed_medline.html (MEDLINE snapshot)BioASQ dataset reference (Krithara et al., 2023) as used in paperQuoteSum dataset reference (Schuster et al., 2024) as used in paper

Risks & Boundaries

Limitations

Only two datasets (BioASQ, QuoteSum); results may not generalize across all domains.

Zero-shot evaluation only; few-shot or fine-tuning could change model rankings.

When Not To Use

When you can provide high-quality few-shot examples or finetune readers (zero-shot focus may mislead).

For domains with very different retrieval characteristics than PubMed/Wikipedia without re-testing retrievers.

Failure Modes

Poor retrieval returns irrelevant snippets and degrades answer quality.

Too many snippets cause context saturation and confusion in the reader.

Core Entities

Models

GPT-3.5 (Turbo-0125)GPT-4o (Turbo-0513)Mixtral (8x7B)LLaMa 3 (70B)Mistral-7BGemma (7B)LLaMa 3 (8B)Qwen 1.5 (7B)

Metrics

ROUGE-LBERTScoreNLI entailment (Ent%)METEORCosine similarity (embedding)

Datasets

BioASQ-QA (Task 10b summary questions)QuoteSumPubMedMEDLINE (2012–2022 subset)Wikipedia (via API)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adding more context boosts QA performance until about 10–15 snippets, then gains stop or reverse.

Model choice shifts best performance by domain: some models excel in biomedicine, others in encyclopedic QA.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Case-aware LLM-as-a-judge scoring: eight enterprise metrics, severity-weighting, and JSON outputs for multi-turn RAG

Key finding

RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

Key finding

Curate systematic reviews + guidelines to make RAG answers more trustworthy for Long COVID

Key finding

Mask untruthful parts of context to cut hallucinations and keep helpful facts

Key finding

Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

Key finding