Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
Choose RAG or LC based on model size, document length, and task. This reduces cost and error: RAG protects smaller models from long-context failure and reduces hallucinations; LC is better for synthesis and reasoning on strong long-context models.
Summary TLDR
LaRA is a 2,326-case benchmark that compares Retrieval-Augmented Generation (RAG) and long-context (LC) LLMs across novels, academic papers, and financial statements for four QA tasks (location, reasoning, comparison, hallucination detection). There is no universal winner: small/weak models often benefit more from RAG, large models with strong long-context ability often do better with LC, RAG is best at hallucination detection, and LC wins at comparison and reasoning for strong models. Use task, model size, context length, and chunking to route queries in practice.
Problem Statement
As LLMs get much longer context windows, practitioners must choose between (A) feeding the full long text to an LLM (LC) or (B) using retrieval to inject selected chunks (RAG). Existing comparisons are inconclusive and suffer from benchmark flaws (short contexts, leakage, truncation). LaRA aims to fix those flaws and give practical guidance on when to use RAG vs LC.
Main Contribution
LaRA benchmark: 2,326 test cases, 3 context types, 4 QA tasks reflecting real queries
Systematic comparison of 11 LLMs (7 open-source, 4 proprietary) across 32k and 128k context windows
Controlled experiments on chunking, chunk size/quantity, and answer position (lost-in-the-middle)
Open-sourced code and dataset to enable reproducible RAG vs LC evaluations
Key Findings
No universal winner — best choice depends on model size, context length, task, and chunks.
LaRA dataset: 2,326 test cases spanning novels, papers, financial statements and four QA types.
Aggregate: at 32k tokens LC is on average 2.40% more accurate than RAG across models; at 128k RAG is on average 3.68% more accurate than LC.
RAG helps weaker models much more: examples include +6.48% and +38.12% accuracy gains (128k) for two smaller models.
Task differences: comparison tasks strongly favor LC (avg gap ≈ 14–15%); hallucination detection strongly favors RAG (avg gap −10% to −22%).
LC models suffer 'lost in the middle' — accuracy drops for answers located near the center; RAG is robust to position.
Chunking matters: for a large model (72B) more retrieved chunks steadily help; for a small model (7B) performance peaks then degrades from noise.
Results
Dataset size
Average LC vs RAG gap (32k)
Average LC vs RAG gap (128k)
Comparison task gap
Hallucination detection gap
Model-specific gains for small models (examples)
Judge reliability
Who Should Care
What To Try In 7 Days
Run a small split test: route comparison/reasoning queries to your largest LC-capable model and hallucination-sensitive queries to RAG.
If using small models, prototype RAG with 600-token chunks, overlap 100, and tune number of chunks; measure accuracy and cost.
Validate an LLM-based judge (e.g., GPT-4o) on 100 samples vs humans to cheaply scale evaluation and compute Cohen's Kappa.
Optimization Features
Token Efficiency
- Use RAG to limit tokens passed to the model
- Accuracy
System Optimization
- Hybrid retrieval (embeddings + BM25) for better chunk recall
Inference Optimization
- Route queries by task and model size
- Segment long documents and use voting to avoid truncation
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Contexts limited to novels, academic papers, and US financial statements; other domains may behave differently.
- RAG config fixed (600-token chunks, overlap 100, 5 chunks/doc) — other configs might change results.
- LLM judge (GPT-4o) may introduce bias despite high agreement with humans.
- Proprietary model access and exact prompting/settings differ, limiting exact reproducibility for some results.
When Not To Use
- If your application uses very different document types (e.g., codebases, logs) without validation on similar data.
- If you cannot control retrieval quality (no reliable embeddings or index), RAG may fail to retrieve needed chunks.
- If you require guaranteed interpretable retrieval chains that this benchmark does not produce.
Failure Modes
- LC hallucination when full context adds noisy, irrelevant information.
- RAG misses required chunks for comparison tasks if retrieval does not surface all relevant pieces.
- Small models suffer from lost-in-the-middle when fed full long contexts.
- Excessive retrieved chunks can introduce noise and reduce accuracy for weaker models.
Core Entities
Models
- Llama-3.2-3B-Instruct
- Llama-3.1-8B-Instruct
- Llama-3.3-70B-Instruct
- Llama-3.3-70B-Instruct-Q8
- Qwen-2.5-7B-Instruct
- Qwen-2.5-72B-Instruct
- Mistral-Nemo-12B
- GPT-4o
- GPT-4o-mini
- Claude-3.5-sonnet
- Gemini-1.5-pro
Metrics
- Accuracy
- Avg GAP (LC minus RAG)
- Cohen's Kappa
- F1 / EM (not used as primary)
Datasets
- LaRA (this paper)
- novels (Gutenberg txt)
- financial statements (annual/quarterly reports 2024)
- concatenated academic papers (arXiv 2024)
Benchmarks
- ∞-bench (InfiniteBench)
- LongBench
- ZeroSCROLLS
- LongBench-V2
- Loong
- LONG2RAG

