Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
If you publish or productize long-document QA, use retrieval with document embeddings — it gives consistent accuracy gains over naively feeding very long text and helps make outputs traceable.
Summary TLDR
The authors release Marathon, a multiple-choice benchmark focused on long-context understanding and reasoning. It combines samples from LongBench and LooGLE, spans contexts from ~2K to >200K characters, and contains 1,530 test items across six tasks (e.g., comprehension, timeline reorder, computation). They evaluate 10 open-source LLMs plus ChatGPT and GPT‑4, and compare three optimization styles: prompt compression (LongLLMLingua) and two RAG setups using OpenAI and Jina embeddings. Main results: retrieval-based RAG (especially Jina) raises average accuracy by ~11–12 percentage points versus the vanilla baseline; prompt compression gives little or mixed gains. Timeline reorder and numeric/‘
Problem Statement
Existing long-context benchmarks use free-form metrics (F1/BLEU/ROUGE) and short samples, which mis-score valid model outputs and fail to test comprehension across very long documents. We need a compact, reliable benchmark that (1) forces selection among vetted alternatives, (2) covers much longer contexts, and (3) enables fair comparison of long-context optimization methods.
Main Contribution
Marathon: a new multiple-choice benchmark for long-context tasks (1,530 items; contexts up to >200K chars) covering six task types.
A head-to-head evaluation of 10 open-source LLMs plus ChatGPT and GPT‑4 across vanilla, prompt-compression (LongLLMLingua), and two RAG setups (OpenAI/Jina embeddings).
Empirical finding: embedding-based RAG improves QA accuracy more than prompt compression; timeline reorder and computation remain hard.
Key Findings
Embedding-based RAG improves average accuracy versus vanilla baseline.
Prompt compression (LongLLMLingua) gives little net gain and can hurt some models.
Timeline reorder and computation tasks are the weakest across models.
Top closed models outperform open-source models by a large margin.
Many open-source models struggle to follow strict output-format instructions.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Task averages (Vanilla) - Timeline Reorder
Task averages (Vanilla) - Computation
Accuracy
Instruction-following (JSON) — example values (Vanilla)
Who Should Care
What To Try In 7 Days
Run a Jina-embedding RAG pipeline on your long-doc QA use case and compare accuracy to your current approach.
Add a strict output-format validator (JSON schema) and a small post-processor to fix malformed model outputs.
Benchmark your most common long-context tasks (timeline ordering, numeric inference) and flag them for special handling.
Optimization Features
Token Efficiency
- context chunking to 12,000-token segments for distractor generation
Infra Optimization
- evaluation runs used 4x A100 80GB; model GPU allocation varies by size
Inference Optimization
- prompt compression (LongLLMLingua) evaluated
- retrieval-augmented generation (embedding-based RAG) evaluated
Reproducibility
Data Urls
- https://github.com/Hambaobao/Marathon (questions and contexts; answers withheld)
- Source datasets: LongBench, LooGLE (both MIT-licensed as noted)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Context lengths are not uniformly distributed; examples >200K chars are rare.
- Passage Retrieval items were sampled from LongBench and are shorter than other tasks.
- RAG evaluation only used OpenAI and Jina embeddings; other embedding systems were not tested due to cost/time.
- The benchmark publishes contexts and questions but not correct answers; online submission is required for scoring.
When Not To Use
- When you need exact training labels or full answer keys (answers are not released).
- If you require even distribution of context lengths beyond current dataset (authors plan to expand).
- For evaluating multimodal long-context capabilities (no multimodal items included).
Failure Modes
- Models produce long, extra text and ignore strict output formats, breaking downstream parsers.
- Prompt compression can drop needed details and sometimes reduce accuracy.
- RAG improvements are uneven: some tasks (timeline, computation) remain poorly solved even after retrieval.
Core Entities
Models
- GPT-4
- ChatGPT
- Yi-34B
- Beluga-70B
- Tulu2-70B
- Qwen-14B
- Mistral-7B
- Zephyr-7B
- ChatGLM3-6B
- Alfred-40B
- Longchat-13B
- StripedHyena-7B
Metrics
- Accuracy
- JSON instruction-following rate
Datasets
- LongBench
- LooGLE
Benchmarks
- Marathon

