RAG helps up to ~10–15 context snippets; model and retriever choice strongly shape results

February 20, 20257 min

Overview

Decision SnapshotReady For Pilot

The paper delivers directly usable guidance (context size, retriever baseline, domain-specific model choice) backed by experiments on two real datasets, but findings are limited to the evaluated models, metrics, and zero-shot setup.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 40%

Authors

Juraj Vladika, Florian Matthes

Links

Abstract / PDF / Code / Data

Why It Matters For Business

When building a RAG product, supplying ~10–15 curated snippets gives the best return: more context adds cost and can add noise. Retrieval quality and reader model must be tuned to your domain.

Who Should Care

Summary TLDR

This empirical study measures how many retrieved context snippets to give a language model, which retriever to use, and which base model to pick for long-form QA. Using two datasets (BioASQ biomedical and QuoteSum encyclopedic) and eight LLMs, the authors find: adding context improves answers up to roughly 10–15 snippets, then plateaus or declines; model choice matters by domain (Mixtral/Mistral/Qwen beat others on biomedical; GPT/LLaMa better on encyclopedic); and open-domain retrieval is much harder than using gold evidence, with BM25 slightly outperforming semantic search on PubMed.

Problem Statement

RAG systems have many moving parts (how much context to pass, which retriever, which reader model). Prior work mostly used short factoid QA and assumed a single gold snippet. We lack systematic guidance for long-form QA where answers must combine multiple snippets.

Main Contribution

Systematic sweep of context size (0,1,3,5,10,15,20,30 snippets) for long-form QA.

Comparative evaluation of two retrievers (BM25 sparse, semantic dense) in closed and open retrieval.

Key Findings

Adding more context boosts QA performance until about 10–15 snippets, then gains stop or reverse.

NumbersMixtral BioASQ entailment: 010 snippets 29.4%50.7% (+21.3pp); open-retrieval stalls after 1520 snippets.

Practical UseSupply roughly 10–15 high-quality snippets to a reader LLM; adding many more risks noise and no extra benefit.

Evidence RefTables 1 & 4; Figures and discussion in §5–§6

Model choice shifts best performance by domain: some models excel in biomedicine, others in encyclopedic QA.

NumbersBioASQ (Ent.% at 10 snippets): Mixtral 50.7% vs GPT-3.5 32.6% and LLaMa 34.5%; QuoteSum (Ent.% at 10): GPT-3.5 44.2% vs.

Practical UsePick your reader model to match domain: test Mixtral/Mistral/Qwen for biomedical tasks and GPT/LLaMa for encyclopedic tasks before deployment.

Evidence RefTable 1 (BioASQ + QuoteSum) and §5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Context-size effect (entailment)Mixtral BioASQ Ent% 010: 29.4%50.7%zero-context (0 snippets)+21.3ppBioASQ (gold snippets)Table 1 (Mixtral Ent.% values)Table 1
Saturation and declineOpen PubMed (Mixtral BM25) Ent% at 10,15,30: 28.9%31.1%31.6%10 snippetssmall change, no steady gains beyond 15BioASQ (open retrieval, PubMed)Table 4; §6.3Table 4

What To Try In 7 Days

Run a small A/B: 5 vs 15 snippets on real queries and compare answer quality.

Compare BM25 and your dense retriever on a domain crawl; prefer BM25 if queries are keyword-heavy.

Benchmark 2 reader LLMs (one open, one commercial) on a 100-question slice per domain to pick the best match.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://www.nlm.nih.gov/databases/download/pubmed_medline.html (MEDLINE snapshot)BioASQ dataset reference (Krithara et al., 2023) as used in paperQuoteSum dataset reference (Schuster et al., 2024) as used in paper

Risks & Boundaries

Limitations

Only two datasets (BioASQ, QuoteSum); results may not generalize across all domains.

Zero-shot evaluation only; few-shot or fine-tuning could change model rankings.

When Not To Use

When you can provide high-quality few-shot examples or finetune readers (zero-shot focus may mislead).

For domains with very different retrieval characteristics than PubMed/Wikipedia without re-testing retrievers.

Failure Modes

Poor retrieval returns irrelevant snippets and degrades answer quality.

Too many snippets cause context saturation and confusion in the reader.

Core Entities

Models

GPT-3.5 (Turbo-0125)GPT-4o (Turbo-0513)Mixtral (8x7B)LLaMa 3 (70B)Mistral-7BGemma (7B)LLaMa 3 (8B)Qwen 1.5 (7B)

Metrics

ROUGE-LBERTScoreNLI entailment (Ent%)METEORCosine similarity (embedding)

Datasets

BioASQ-QA (Task 10b summary questions)QuoteSumPubMedMEDLINE (2012–2022 subset)Wikipedia (via API)