Marathon: a multiple-choice benchmark that stresses LLMs with very long documents (up to ~260K chars)

December 15, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

2

Authors

Lei Zhang, Yunshui Li, Ziqiang Liu, Jiaxi yang, Junhao Liu, Longze Chen, Run Luo, Min Yang

Links

Abstract / PDF

Why It Matters For Business

If you publish or productize long-document QA, use retrieval with document embeddings — it gives consistent accuracy gains over naively feeding very long text and helps make outputs traceable.

Summary TLDR

The authors release Marathon, a multiple-choice benchmark focused on long-context understanding and reasoning. It combines samples from LongBench and LooGLE, spans contexts from ~2K to >200K characters, and contains 1,530 test items across six tasks (e.g., comprehension, timeline reorder, computation). They evaluate 10 open-source LLMs plus ChatGPT and GPT‑4, and compare three optimization styles: prompt compression (LongLLMLingua) and two RAG setups using OpenAI and Jina embeddings. Main results: retrieval-based RAG (especially Jina) raises average accuracy by ~11–12 percentage points versus the vanilla baseline; prompt compression gives little or mixed gains. Timeline reorder and numeric/‘

Problem Statement

Existing long-context benchmarks use free-form metrics (F1/BLEU/ROUGE) and short samples, which mis-score valid model outputs and fail to test comprehension across very long documents. We need a compact, reliable benchmark that (1) forces selection among vetted alternatives, (2) covers much longer contexts, and (3) enables fair comparison of long-context optimization methods.

Main Contribution

Marathon: a new multiple-choice benchmark for long-context tasks (1,530 items; contexts up to >200K chars) covering six task types.

A head-to-head evaluation of 10 open-source LLMs plus ChatGPT and GPT‑4 across vanilla, prompt-compression (LongLLMLingua), and two RAG setups (OpenAI/Jina embeddings).

Empirical finding: embedding-based RAG improves QA accuracy more than prompt compression; timeline reorder and computation remain hard.

Key Findings

Embedding-based RAG improves average accuracy versus vanilla baseline.

NumbersVanilla avg 41.76% → OpenAI RAG avg 50.46% → Jina RAG avg 53.10%

Prompt compression (LongLLMLingua) gives little net gain and can hurt some models.

NumbersVanilla avg 41.76% vs LongLLMLingua avg 40.96% (≈ -0.8pp)

Timeline reorder and computation tasks are the weakest across models.

NumbersVanilla task averages: Timeline Reorder 30.30%, Computation 27.11%

Top closed models outperform open-source models by a large margin.

NumbersGPT-4 avg 78.59% vs best open-source under Jina RAG (Yi) 63.81%

Many open-source models struggle to follow strict output-format instructions.

NumbersVanilla JSON compliance varies widely (e.g., Longchat 92.29% vs Beluga 22.48%)

Results

Accuracy

Value41.76%

Accuracy

Value50.46%

BaselineVanilla 41.76%

Accuracy

Value53.10%

BaselineVanilla 41.76%

Accuracy

Value40.96%

BaselineVanilla 41.76%

Task averages (Vanilla) - Timeline Reorder

Value30.30%

Task averages (Vanilla) - Computation

Value27.11%

Accuracy

ValueGPT-4 avg 78.59%

Instruction-following (JSON) — example values (Vanilla)

ValueLongchat 92.29% | Beluga 22.48% | Yi 38.95%

Who Should Care

What To Try In 7 Days

Run a Jina-embedding RAG pipeline on your long-doc QA use case and compare accuracy to your current approach.

Add a strict output-format validator (JSON schema) and a small post-processor to fix malformed model outputs.

Benchmark your most common long-context tasks (timeline ordering, numeric inference) and flag them for special handling.

Optimization Features

Token Efficiency

  • context chunking to 12,000-token segments for distractor generation

Infra Optimization

  • evaluation runs used 4x A100 80GB; model GPU allocation varies by size

Inference Optimization

  • prompt compression (LongLLMLingua) evaluated
  • retrieval-augmented generation (embedding-based RAG) evaluated

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Context lengths are not uniformly distributed; examples >200K chars are rare.
  • Passage Retrieval items were sampled from LongBench and are shorter than other tasks.
  • RAG evaluation only used OpenAI and Jina embeddings; other embedding systems were not tested due to cost/time.
  • The benchmark publishes contexts and questions but not correct answers; online submission is required for scoring.

When Not To Use

  • When you need exact training labels or full answer keys (answers are not released).
  • If you require even distribution of context lengths beyond current dataset (authors plan to expand).
  • For evaluating multimodal long-context capabilities (no multimodal items included).

Failure Modes

  • Models produce long, extra text and ignore strict output formats, breaking downstream parsers.
  • Prompt compression can drop needed details and sometimes reduce accuracy.
  • RAG improvements are uneven: some tasks (timeline, computation) remain poorly solved even after retrieval.

Core Entities

Models

  • GPT-4
  • ChatGPT
  • Yi-34B
  • Beluga-70B
  • Tulu2-70B
  • Qwen-14B
  • Mistral-7B
  • Zephyr-7B
  • ChatGLM3-6B
  • Alfred-40B
  • Longchat-13B
  • StripedHyena-7B

Metrics

  • Accuracy
  • JSON instruction-following rate

Datasets

  • LongBench
  • LooGLE

Benchmarks

  • Marathon