Overview
The paper delivers directly usable guidance (context size, retriever baseline, domain-specific model choice) backed by experiments on two real datasets, but findings are limited to the evaluated models, metrics, and zero-shot setup.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 45%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
When building a RAG product, supplying ~10–15 curated snippets gives the best return: more context adds cost and can add noise. Retrieval quality and reader model must be tuned to your domain.
Who Should Care
Summary TLDR
This empirical study measures how many retrieved context snippets to give a language model, which retriever to use, and which base model to pick for long-form QA. Using two datasets (BioASQ biomedical and QuoteSum encyclopedic) and eight LLMs, the authors find: adding context improves answers up to roughly 10–15 snippets, then plateaus or declines; model choice matters by domain (Mixtral/Mistral/Qwen beat others on biomedical; GPT/LLaMa better on encyclopedic); and open-domain retrieval is much harder than using gold evidence, with BM25 slightly outperforming semantic search on PubMed.
Problem Statement
RAG systems have many moving parts (how much context to pass, which retriever, which reader model). Prior work mostly used short factoid QA and assumed a single gold snippet. We lack systematic guidance for long-form QA where answers must combine multiple snippets.
Main Contribution
Systematic sweep of context size (0,1,3,5,10,15,20,30 snippets) for long-form QA.
Comparative evaluation of two retrievers (BM25 sparse, semantic dense) in closed and open retrieval.
Key Findings
Adding more context boosts QA performance until about 10–15 snippets, then gains stop or reverse.
Model choice shifts best performance by domain: some models excel in biomedicine, others in encyclopedic QA.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Context-size effect (entailment) | Mixtral BioASQ Ent% 0→10: 29.4%→50.7% | zero-context (0 snippets) | +21.3pp | BioASQ (gold snippets) | Table 1 (Mixtral Ent.% values) | Table 1 |
| Saturation and decline | Open PubMed (Mixtral BM25) Ent% at 10,15,30: 28.9%→31.1%→31.6% | 10 snippets | small change, no steady gains beyond 15 | BioASQ (open retrieval, PubMed) | Table 4; §6.3 | Table 4 |
What To Try In 7 Days
Run a small A/B: 5 vs 15 snippets on real queries and compare answer quality.
Compare BM25 and your dense retriever on a domain crawl; prefer BM25 if queries are keyword-heavy.
Benchmark 2 reader LLMs (one open, one commercial) on a 100-question slice per domain to pick the best match.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Only two datasets (BioASQ, QuoteSum); results may not generalize across all domains.
Zero-shot evaluation only; few-shot or fine-tuning could change model rankings.
When Not To Use
When you can provide high-quality few-shot examples or finetune readers (zero-shot focus may mislead).
For domains with very different retrieval characteristics than PubMed/Wikipedia without re-testing retrievers.
Failure Modes
Poor retrieval returns irrelevant snippets and degrades answer quality.
Too many snippets cause context saturation and confusion in the reader.

