Overview
The benchmark is practical and public, and the experiments show consistent RAG gains; however, task coverage and embedding choices are limited and answers are withheld to protect evaluation integrity.
Citations2
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/8
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
If you publish or productize long-document QA, use retrieval with document embeddings — it gives consistent accuracy gains over naively feeding very long text and helps make outputs traceable.
Who Should Care
Summary TLDR
The authors release Marathon, a multiple-choice benchmark focused on long-context understanding and reasoning. It combines samples from LongBench and LooGLE, spans contexts from ~2K to >200K characters, and contains 1,530 test items across six tasks (e.g., comprehension, timeline reorder, computation). They evaluate 10 open-source LLMs plus ChatGPT and GPT‑4, and compare three optimization styles: prompt compression (LongLLMLingua) and two RAG setups using OpenAI and Jina embeddings. Main results: retrieval-based RAG (especially Jina) raises average accuracy by ~11–12 percentage points versus the vanilla baseline; prompt compression gives little or mixed gains. Timeline reorder and numeric/‘
Problem Statement
Existing long-context benchmarks use free-form metrics (F1/BLEU/ROUGE) and short samples, which mis-score valid model outputs and fail to test comprehension across very long documents. We need a compact, reliable benchmark that (1) forces selection among vetted alternatives, (2) covers much longer contexts, and (3) enables fair comparison of long-context optimization methods.
Main Contribution
Marathon: a new multiple-choice benchmark for long-context tasks (1,530 items; contexts up to >200K chars) covering six task types.
A head-to-head evaluation of 10 open-source LLMs plus ChatGPT and GPT‑4 across vanilla, prompt-compression (LongLLMLingua), and two RAG setups (OpenAI/Jina embeddings).
Key Findings
Embedding-based RAG improves average accuracy versus vanilla baseline.
Prompt compression (LongLLMLingua) gives little net gain and can hurt some models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 41.76% | — | — | Marathon (all tasks) | Table 4 (Vanilla Avg.) | Table 4 |
| Accuracy | 50.46% | Vanilla 41.76% | +8.70 pp | Marathon (all tasks) | Table 4 (OpenAI Embedding RAG Avg.) | Table 4 |
What To Try In 7 Days
Run a Jina-embedding RAG pipeline on your long-doc QA use case and compare accuracy to your current approach.
Add a strict output-format validator (JSON schema) and a small post-processor to fix malformed model outputs.
Benchmark your most common long-context tasks (timeline ordering, numeric inference) and flag them for special handling.
Optimization Features
Token Efficiency
Infra Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Context lengths are not uniformly distributed; examples >200K chars are rare.
Passage Retrieval items were sampled from LongBench and are shorter than other tasks.
When Not To Use
When you need exact training labels or full answer keys (answers are not released).
If you require even distribution of context lengths beyond current dataset (authors plan to expand).
Failure Modes
Models produce long, extra text and ignore strict output formats, breaking downstream parsers.
Prompt compression can drop needed details and sometimes reduce accuracy.

