Marathon: a multiple-choice benchmark that stresses LLMs with very long documents (up to ~260K chars)

December 15, 20238 min

Overview

Decision SnapshotReady For Pilot

The benchmark is practical and public, and the experiments show consistent RAG gains; however, task coverage and embedding choices are limited and answers are withheld to protect evaluation integrity.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Lei Zhang, Yunshui Li, Ziqiang Liu, Jiaxi yang, Junhao Liu, Longze Chen, Run Luo, Min Yang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you publish or productize long-document QA, use retrieval with document embeddings — it gives consistent accuracy gains over naively feeding very long text and helps make outputs traceable.

Who Should Care

Summary TLDR

The authors release Marathon, a multiple-choice benchmark focused on long-context understanding and reasoning. It combines samples from LongBench and LooGLE, spans contexts from ~2K to >200K characters, and contains 1,530 test items across six tasks (e.g., comprehension, timeline reorder, computation). They evaluate 10 open-source LLMs plus ChatGPT and GPT‑4, and compare three optimization styles: prompt compression (LongLLMLingua) and two RAG setups using OpenAI and Jina embeddings. Main results: retrieval-based RAG (especially Jina) raises average accuracy by ~11–12 percentage points versus the vanilla baseline; prompt compression gives little or mixed gains. Timeline reorder and numeric/‘

Problem Statement

Existing long-context benchmarks use free-form metrics (F1/BLEU/ROUGE) and short samples, which mis-score valid model outputs and fail to test comprehension across very long documents. We need a compact, reliable benchmark that (1) forces selection among vetted alternatives, (2) covers much longer contexts, and (3) enables fair comparison of long-context optimization methods.

Main Contribution

Marathon: a new multiple-choice benchmark for long-context tasks (1,530 items; contexts up to >200K chars) covering six task types.

A head-to-head evaluation of 10 open-source LLMs plus ChatGPT and GPT‑4 across vanilla, prompt-compression (LongLLMLingua), and two RAG setups (OpenAI/Jina embeddings).

Key Findings

Embedding-based RAG improves average accuracy versus vanilla baseline.

NumbersVanilla avg 41.76% → OpenAI RAG avg 50.46% → Jina RAG avg 53.10%

Practical UseIf you must answer questions over long documents, add a retrieval step with document embeddings — it typically boosts end-to-end accuracy by ~9–11 percentage points vs. sending raw long context.

Evidence RefTable 4 (Avg. rows for each method)

Prompt compression (LongLLMLingua) gives little net gain and can hurt some models.

NumbersVanilla avg 41.76% vs LongLLMLingua avg 40.96% (≈ -0.8pp)

Practical UseCompression methods may reduce context size but don’t reliably improve QA accuracy; test compression carefully per model before deployment.

Evidence RefTable 4 (Avg. rows for Vanilla and LongLLMLingua)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy41.76%Marathon (all tasks)Table 4 (Vanilla Avg.)Table 4
Accuracy50.46%Vanilla 41.76%+8.70 ppMarathon (all tasks)Table 4 (OpenAI Embedding RAG Avg.)Table 4

What To Try In 7 Days

Run a Jina-embedding RAG pipeline on your long-doc QA use case and compare accuracy to your current approach.

Add a strict output-format validator (JSON schema) and a small post-processor to fix malformed model outputs.

Benchmark your most common long-context tasks (timeline ordering, numeric inference) and flag them for special handling.

Optimization Features

Token Efficiency
context chunking to 12,000-token segments for distractor generation
Infra Optimization
evaluation runs used 4x A100 80GB; model GPU allocation varies by size
Inference Optimization
prompt compression (LongLLMLingua) evaluatedretrieval-augmented generation (embedding-based RAG) evaluated

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://github.com/Hambaobao/Marathon (questions and contexts; answers withheld)Source datasets: LongBench, LooGLE (both MIT-licensed as noted)

Risks & Boundaries

Limitations

Context lengths are not uniformly distributed; examples >200K chars are rare.

Passage Retrieval items were sampled from LongBench and are shorter than other tasks.

When Not To Use

When you need exact training labels or full answer keys (answers are not released).

If you require even distribution of context lengths beyond current dataset (authors plan to expand).

Failure Modes

Models produce long, extra text and ignore strict output formats, breaking downstream parsers.

Prompt compression can drop needed details and sometimes reduce accuracy.

Core Entities

Models

GPT-4ChatGPTYi-34BBeluga-70BTulu2-70BQwen-14BMistral-7BZephyr-7BChatGLM3-6BAlfred-40BLongchat-13BStripedHyena-7B

Metrics

AccuracyJSON instruction-following rate

Datasets

LongBenchLooGLE

Benchmarks

Marathon