Marathon: a multiple-choice benchmark that stresses LLMs with very long documents (up to ~260K chars)

Overview

Decision SnapshotReady For Pilot

The benchmark is practical and public, and the experiments show consistent RAG gains; however, task coverage and embedding choices are limited and answers are withheld to protect evaluation integrity.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Lei Zhang, Yunshui Li, Ziqiang Liu, Jiaxi yang, Junhao Liu, Longze Chen, Run Luo, Min Yang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you publish or productize long-document QA, use retrieval with document embeddings — it gives consistent accuracy gains over naively feeding very long text and helps make outputs traceable.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The authors release Marathon, a multiple-choice benchmark focused on long-context understanding and reasoning. It combines samples from LongBench and LooGLE, spans contexts from ~2K to >200K characters, and contains 1,530 test items across six tasks (e.g., comprehension, timeline reorder, computation). They evaluate 10 open-source LLMs plus ChatGPT and GPT‑4, and compare three optimization styles: prompt compression (LongLLMLingua) and two RAG setups using OpenAI and Jina embeddings. Main results: retrieval-based RAG (especially Jina) raises average accuracy by ~11–12 percentage points versus the vanilla baseline; prompt compression gives little or mixed gains. Timeline reorder and numeric/‘

Problem Statement

Existing long-context benchmarks use free-form metrics (F1/BLEU/ROUGE) and short samples, which mis-score valid model outputs and fail to test comprehension across very long documents. We need a compact, reliable benchmark that (1) forces selection among vetted alternatives, (2) covers much longer contexts, and (3) enables fair comparison of long-context optimization methods.

Main Contribution

Marathon: a new multiple-choice benchmark for long-context tasks (1,530 items; contexts up to >200K chars) covering six task types.

A head-to-head evaluation of 10 open-source LLMs plus ChatGPT and GPT‑4 across vanilla, prompt-compression (LongLLMLingua), and two RAG setups (OpenAI/Jina embeddings).

Key Findings

Embedding-based RAG improves average accuracy versus vanilla baseline.

NumbersVanilla avg 41.76% → OpenAI RAG avg 50.46% → Jina RAG avg 53.10%

Practical UseIf you must answer questions over long documents, add a retrieval step with document embeddings — it typically boosts end-to-end accuracy by ~9–11 percentage points vs. sending raw long context.

Evidence RefTable 4 (Avg. rows for each method)

Prompt compression (LongLLMLingua) gives little net gain and can hurt some models.

NumbersVanilla avg 41.76% vs LongLLMLingua avg 40.96% (≈ -0.8pp)

Practical UseCompression methods may reduce context size but don’t reliably improve QA accuracy; test compression carefully per model before deployment.

Evidence RefTable 4 (Avg. rows for Vanilla and LongLLMLingua)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	41.76%	—	—	Marathon (all tasks)	Table 4 (Vanilla Avg.)	Table 4
Accuracy	50.46%	Vanilla 41.76%	+8.70 pp	Marathon (all tasks)	Table 4 (OpenAI Embedding RAG Avg.)	Table 4

What To Try In 7 Days

Run a Jina-embedding RAG pipeline on your long-doc QA use case and compare accuracy to your current approach.

Add a strict output-format validator (JSON schema) and a small post-processor to fix malformed model outputs.

Benchmark your most common long-context tasks (timeline ordering, numeric inference) and flag them for special handling.

Optimization Features

Token Efficiency

context chunking to 12,000-token segments for distractor generation

Infra Optimization

evaluation runs used 4x A100 80GB; model GPU allocation varies by size

Inference Optimization

prompt compression (LongLLMLingua) evaluatedretrieval-augmented generation (embedding-based RAG) evaluated

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Hambaobao/Marathon https://openbenchmark.online/marathon

Data URLs

https://github.com/Hambaobao/Marathon (questions and contexts; answers withheld)Source datasets: LongBench, LooGLE (both MIT-licensed as noted)

Risks & Boundaries

Limitations

Context lengths are not uniformly distributed; examples >200K chars are rare.

Passage Retrieval items were sampled from LongBench and are shorter than other tasks.

When Not To Use

When you need exact training labels or full answer keys (answers are not released).

If you require even distribution of context lengths beyond current dataset (authors plan to expand).

Failure Modes

Models produce long, extra text and ignore strict output formats, breaking downstream parsers.

Prompt compression can drop needed details and sometimes reduce accuracy.

Core Entities

Models

GPT-4ChatGPTYi-34BBeluga-70BTulu2-70BQwen-14BMistral-7BZephyr-7BChatGLM3-6BAlfred-40BLongchat-13BStripedHyena-7B

Metrics

AccuracyJSON instruction-following rate

Datasets

LongBenchLooGLE

Benchmarks

Marathon

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Embedding-based RAG improves average accuracy versus vanilla baseline.

Prompt compression (LongLLMLingua) gives little net gain and can hurt some models.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding