Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.45
Citation Count
1
Why It Matters For Business
When building a RAG product, supplying ~10–15 curated snippets gives the best return: more context adds cost and can add noise. Retrieval quality and reader model must be tuned to your domain.
Summary TLDR
This empirical study measures how many retrieved context snippets to give a language model, which retriever to use, and which base model to pick for long-form QA. Using two datasets (BioASQ biomedical and QuoteSum encyclopedic) and eight LLMs, the authors find: adding context improves answers up to roughly 10–15 snippets, then plateaus or declines; model choice matters by domain (Mixtral/Mistral/Qwen beat others on biomedical; GPT/LLaMa better on encyclopedic); and open-domain retrieval is much harder than using gold evidence, with BM25 slightly outperforming semantic search on PubMed.
Problem Statement
RAG systems have many moving parts (how much context to pass, which retriever, which reader model). Prior work mostly used short factoid QA and assumed a single gold snippet. We lack systematic guidance for long-form QA where answers must combine multiple snippets.
Main Contribution
Systematic sweep of context size (0,1,3,5,10,15,20,30 snippets) for long-form QA.
Comparative evaluation of two retrievers (BM25 sparse, semantic dense) in closed and open retrieval.
Benchmark of eight LLMs (open and commercial) across biomedical (BioASQ) and encyclopedic (QuoteSum) long-form QA.
Key Findings
Adding more context boosts QA performance until about 10–15 snippets, then gains stop or reverse.
Model choice shifts best performance by domain: some models excel in biomedicine, others in encyclopedic QA.
Open-domain retrieval (searching millions of docs) yields much lower QA scores than using gold snippets.
Sparse BM25 retrieval slightly outperformed semantic dense retrieval on PubMed for final QA.
LLM internal knowledge can beat poorly retrieved context; bad retrieval can hurt answers.
Results
Context-size effect (entailment)
Saturation and decline
Retriever comparison
Domain model gap
Who Should Care
What To Try In 7 Days
Run a small A/B: 5 vs 15 snippets on real queries and compare answer quality.
Compare BM25 and your dense retriever on a domain crawl; prefer BM25 if queries are keyword-heavy.
Benchmark 2 reader LLMs (one open, one commercial) on a 100-question slice per domain to pick the best match.
Reproducibility
Data Urls
- https://www.nlm.nih.gov/databases/download/pubmed_medline.html (MEDLINE snapshot)
- BioASQ dataset reference (Krithara et al., 2023) as used in paper
- QuoteSum dataset reference (Schuster et al., 2024) as used in paper
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only two datasets (BioASQ, QuoteSum); results may not generalize across all domains.
- Zero-shot evaluation only; few-shot or fine-tuning could change model rankings.
- Automated metrics (ROUGE, BERTScore, NLI) have known blind spots; no human evaluation was performed.
- Model and retriever landscape evolves quickly; experiments reflect mid‑2024 snapshot.
When Not To Use
- When you can provide high-quality few-shot examples or finetune readers (zero-shot focus may mislead).
- For domains with very different retrieval characteristics than PubMed/Wikipedia without re-testing retrievers.
- If you need human-level evaluation of answer correctness; automated metrics were used.
Failure Modes
- Poor retrieval returns irrelevant snippets and degrades answer quality.
- Too many snippets cause context saturation and confusion in the reader.
- Conflicts between LLM internal knowledge and retrieved context yield inconsistent answers.
Core Entities
Models
- GPT-3.5 (Turbo-0125)
- GPT-4o (Turbo-0513)
- Mixtral (8x7B)
- LLaMa 3 (70B)
- Mistral-7B
- Gemma (7B)
- LLaMa 3 (8B)
- Qwen 1.5 (7B)
Metrics
- ROUGE-L
- BERTScore
- NLI entailment (Ent%)
- METEOR
- Cosine similarity (embedding)
Datasets
- BioASQ-QA (Task 10b summary questions)
- QuoteSum
- PubMed
- MEDLINE (2012–2022 subset)
- Wikipedia (via API)

