LaRA: when to use retrieval vs feeding the full long context

February 14, 20259 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

1

Authors

Kuan Li, Liwen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Shuai Wang, Minhao Cheng

Links

Abstract / PDF

Why It Matters For Business

Choose RAG or LC based on model size, document length, and task. This reduces cost and error: RAG protects smaller models from long-context failure and reduces hallucinations; LC is better for synthesis and reasoning on strong long-context models.

Summary TLDR

LaRA is a 2,326-case benchmark that compares Retrieval-Augmented Generation (RAG) and long-context (LC) LLMs across novels, academic papers, and financial statements for four QA tasks (location, reasoning, comparison, hallucination detection). There is no universal winner: small/weak models often benefit more from RAG, large models with strong long-context ability often do better with LC, RAG is best at hallucination detection, and LC wins at comparison and reasoning for strong models. Use task, model size, context length, and chunking to route queries in practice.

Problem Statement

As LLMs get much longer context windows, practitioners must choose between (A) feeding the full long text to an LLM (LC) or (B) using retrieval to inject selected chunks (RAG). Existing comparisons are inconclusive and suffer from benchmark flaws (short contexts, leakage, truncation). LaRA aims to fix those flaws and give practical guidance on when to use RAG vs LC.

Main Contribution

LaRA benchmark: 2,326 test cases, 3 context types, 4 QA tasks reflecting real queries

Systematic comparison of 11 LLMs (7 open-source, 4 proprietary) across 32k and 128k context windows

Controlled experiments on chunking, chunk size/quantity, and answer position (lost-in-the-middle)

Open-sourced code and dataset to enable reproducible RAG vs LC evaluations

Key Findings

No universal winner — best choice depends on model size, context length, task, and chunks.

LaRA dataset: 2,326 test cases spanning novels, papers, financial statements and four QA types.

Numbers2326 test cases; Table 3

Aggregate: at 32k tokens LC is on average 2.40% more accurate than RAG across models; at 128k RAG is on average 3.68% more accurate than LC.

NumbersLC − RAG = +2.40% (32k); LC − RAG = −3.68% (128k)

RAG helps weaker models much more: examples include +6.48% and +38.12% accuracy gains (128k) for two smaller models.

Numbers+6.48% (Llama-3.2-3B), +38.12% (Mistral-Nemo-12B) at 128k

Task differences: comparison tasks strongly favor LC (avg gap ≈ 14–15%); hallucination detection strongly favors RAG (avg gap −10% to −22%).

NumbersComparison gap ≈ +15.22% (32k) and +14.30% (128k); Hallucination gap = −10.38% (32k), −22.36% (128k)

LC models suffer 'lost in the middle' — accuracy drops for answers located near the center; RAG is robust to position.

NumbersCohen's analysis shows position-correlated drop (plots in Section D); explicit declines in weaker models like Qwen-2.5-7

Chunking matters: for a large model (72B) more retrieved chunks steadily help; for a small model (7B) performance peaks then degrades from noise.

NumbersFigure 2 results on Qwen-2.5-72B vs Qwen-2.5-7B

Results

Dataset size

Value2326 test cases across 3 context types & 4 tasks

Average LC vs RAG gap (32k)

ValueLC +2.40% accuracy

BaselineRAG

Average LC vs RAG gap (128k)

ValueRAG +3.68% accuracy (i.e., LC −3.68%)

BaselineRAG

Comparison task gap

ValueLC +15.22% (32k) and +14.30% (128k)

BaselineRAG

Hallucination detection gap

ValueRAG advantage: −10.38% (32k), −22.36% (128k)

BaselineRAG

Model-specific gains for small models (examples)

ValueLlama-3.2-3B: +6.48% (128k); Mistral-Nemo-12B: +38.12% (128k)

BaselineLC vs RAG

Judge reliability

ValueCohen's Kappa ≈ 0.90–0.98 between GPT-4o judge and humans

Baselinehuman evaluation

Who Should Care

What To Try In 7 Days

Run a small split test: route comparison/reasoning queries to your largest LC-capable model and hallucination-sensitive queries to RAG.

If using small models, prototype RAG with 600-token chunks, overlap 100, and tune number of chunks; measure accuracy and cost.

Validate an LLM-based judge (e.g., GPT-4o) on 100 samples vs humans to cheaply scale evaluation and compute Cohen's Kappa.

Optimization Features

Token Efficiency

  • Use RAG to limit tokens passed to the model
  • Accuracy

System Optimization

  • Hybrid retrieval (embeddings + BM25) for better chunk recall

Inference Optimization

  • Route queries by task and model size
  • Segment long documents and use voting to avoid truncation

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Contexts limited to novels, academic papers, and US financial statements; other domains may behave differently.
  • RAG config fixed (600-token chunks, overlap 100, 5 chunks/doc) — other configs might change results.
  • LLM judge (GPT-4o) may introduce bias despite high agreement with humans.
  • Proprietary model access and exact prompting/settings differ, limiting exact reproducibility for some results.

When Not To Use

  • If your application uses very different document types (e.g., codebases, logs) without validation on similar data.
  • If you cannot control retrieval quality (no reliable embeddings or index), RAG may fail to retrieve needed chunks.
  • If you require guaranteed interpretable retrieval chains that this benchmark does not produce.

Failure Modes

  • LC hallucination when full context adds noisy, irrelevant information.
  • RAG misses required chunks for comparison tasks if retrieval does not surface all relevant pieces.
  • Small models suffer from lost-in-the-middle when fed full long contexts.
  • Excessive retrieved chunks can introduce noise and reduce accuracy for weaker models.

Core Entities

Models

  • Llama-3.2-3B-Instruct
  • Llama-3.1-8B-Instruct
  • Llama-3.3-70B-Instruct
  • Llama-3.3-70B-Instruct-Q8
  • Qwen-2.5-7B-Instruct
  • Qwen-2.5-72B-Instruct
  • Mistral-Nemo-12B
  • GPT-4o
  • GPT-4o-mini
  • Claude-3.5-sonnet
  • Gemini-1.5-pro

Metrics

  • Accuracy
  • Avg GAP (LC minus RAG)
  • Cohen's Kappa
  • F1 / EM (not used as primary)

Datasets

  • LaRA (this paper)
  • novels (Gutenberg txt)
  • financial statements (annual/quarterly reports 2024)
  • concatenated academic papers (arXiv 2024)

Benchmarks

  • ∞-bench (InfiniteBench)
  • LongBench
  • ZeroSCROLLS
  • LongBench-V2
  • Loong
  • LONG2RAG