RAG helps up to ~10–15 context snippets; model and retriever choice strongly shape results

February 20, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.45

Citation Count

1

Authors

Juraj Vladika, Florian Matthes

Links

Abstract / PDF

Why It Matters For Business

When building a RAG product, supplying ~10–15 curated snippets gives the best return: more context adds cost and can add noise. Retrieval quality and reader model must be tuned to your domain.

Summary TLDR

This empirical study measures how many retrieved context snippets to give a language model, which retriever to use, and which base model to pick for long-form QA. Using two datasets (BioASQ biomedical and QuoteSum encyclopedic) and eight LLMs, the authors find: adding context improves answers up to roughly 10–15 snippets, then plateaus or declines; model choice matters by domain (Mixtral/Mistral/Qwen beat others on biomedical; GPT/LLaMa better on encyclopedic); and open-domain retrieval is much harder than using gold evidence, with BM25 slightly outperforming semantic search on PubMed.

Problem Statement

RAG systems have many moving parts (how much context to pass, which retriever, which reader model). Prior work mostly used short factoid QA and assumed a single gold snippet. We lack systematic guidance for long-form QA where answers must combine multiple snippets.

Main Contribution

Systematic sweep of context size (0,1,3,5,10,15,20,30 snippets) for long-form QA.

Comparative evaluation of two retrievers (BM25 sparse, semantic dense) in closed and open retrieval.

Benchmark of eight LLMs (open and commercial) across biomedical (BioASQ) and encyclopedic (QuoteSum) long-form QA.

Key Findings

Adding more context boosts QA performance until about 10–15 snippets, then gains stop or reverse.

NumbersMixtral BioASQ entailment: 0→10 snippets 29.4%→50.7% (+21.3pp); open-retrieval stalls after 15–20 snippets.

Model choice shifts best performance by domain: some models excel in biomedicine, others in encyclopedic QA.

NumbersBioASQ (Ent.% at 10 snippets): Mixtral 50.7% vs GPT-3.5 32.6% and LLaMa 34.5%; QuoteSum (Ent.% at 10): GPT-3.5 44.2% vs.

Open-domain retrieval (searching millions of docs) yields much lower QA scores than using gold snippets.

NumbersOpen PubMed Ent.% often ~20–30% vs gold-snippet settings where top models reach ~50% (Mixtral).

Sparse BM25 retrieval slightly outperformed semantic dense retrieval on PubMed for final QA.

NumbersGPT-4o Ent.% at 10 snippets: semantic 19.9% vs BM25 21.5%; Mixtral: semantic 27.6% vs BM25 28.9%.

LLM internal knowledge can beat poorly retrieved context; bad retrieval can hurt answers.

NumbersExamples and discussion show zero-context (internal) answers sometimes scored higher than RAG answers with 1–10 weakly‑f

Results

Context-size effect (entailment)

ValueMixtral BioASQ Ent% 0→10: 29.4%→50.7%

Baselinezero-context (0 snippets)

Saturation and decline

ValueOpen PubMed (Mixtral BM25) Ent% at 10,15,30: 28.9%→31.1%→31.6%

Baseline10 snippets

Retriever comparison

ValueGPT-4o Ent% at 10: semantic 19.9% vs BM25 21.5%

Baselinesemantic retrieval

Domain model gap

ValueBioASQ Ent% at 10: Mixtral 50.7% vs GPT-3.5 32.6%; QuoteSum Ent% at 10: GPT-3.5 44.2% vs Mixtral 35.9%

Baselineother models in same table

Who Should Care

What To Try In 7 Days

Run a small A/B: 5 vs 15 snippets on real queries and compare answer quality.

Compare BM25 and your dense retriever on a domain crawl; prefer BM25 if queries are keyword-heavy.

Benchmark 2 reader LLMs (one open, one commercial) on a 100-question slice per domain to pick the best match.

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only two datasets (BioASQ, QuoteSum); results may not generalize across all domains.
  • Zero-shot evaluation only; few-shot or fine-tuning could change model rankings.
  • Automated metrics (ROUGE, BERTScore, NLI) have known blind spots; no human evaluation was performed.
  • Model and retriever landscape evolves quickly; experiments reflect mid‑2024 snapshot.

When Not To Use

  • When you can provide high-quality few-shot examples or finetune readers (zero-shot focus may mislead).
  • For domains with very different retrieval characteristics than PubMed/Wikipedia without re-testing retrievers.
  • If you need human-level evaluation of answer correctness; automated metrics were used.

Failure Modes

  • Poor retrieval returns irrelevant snippets and degrades answer quality.
  • Too many snippets cause context saturation and confusion in the reader.
  • Conflicts between LLM internal knowledge and retrieved context yield inconsistent answers.

Core Entities

Models

  • GPT-3.5 (Turbo-0125)
  • GPT-4o (Turbo-0513)
  • Mixtral (8x7B)
  • LLaMa 3 (70B)
  • Mistral-7B
  • Gemma (7B)
  • LLaMa 3 (8B)
  • Qwen 1.5 (7B)

Metrics

  • ROUGE-L
  • BERTScore
  • NLI entailment (Ent%)
  • METEOR
  • Cosine similarity (embedding)

Datasets

  • BioASQ-QA (Task 10b summary questions)
  • QuoteSum
  • PubMed
  • MEDLINE (2012–2022 subset)
  • Wikipedia (via API)