LongBench — 21 long-text tasks (Chinese+English) to measure LLMs' long-context understanding up to tens of thousands of tokens

August 28, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.55

Cost Impact Score

0.5

Citation Count

8

Authors

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

Links

Abstract / PDF

Why It Matters For Business

If you process long documents (reports, legal files, code repos), LongBench measures real-world long-context ability and shows whether models truly use long inputs or just memorize shortcuts.

Summary TLDR

LongBench is a bilingual benchmark for evaluating how well language models use very long contexts. It bundles 21 tasks across six categories (QA, multi-doc QA, summarization, few-shot in-context learning, synthetic tests, and code completion) into a unified format with automatic scoring. The authors evaluate 8 popular models on 4,750 test instances (avg length 6,711 words English, 13,386 characters Chinese). Main takeaways: GPT-3.5-Turbo-16k leads among evaluated models but declines on very long inputs; position-scaling and fine-tuning for long contexts help substantially; retrieval or summarization compression helps weaker models but does not replace native long-context training.

Problem Statement

Most LLMs handle only a few thousand tokens, yet real inputs (books, reports, repos) require thousands to tens of thousands of tokens. There is no comprehensive bilingual, multitask benchmark focused on long-context usage to compare models and methods that extend context length.

Main Contribution

LongBench: a bilingual (English/Chinese) benchmark with 21 datasets across 6 task categories for long-context evaluation.

LongBench-E: a length-balanced subset to analyze performance by context length (0–4k, 4k–8k, 8k+).

Standardized format and automated evaluation (ROUGE-L, F1, EditSim) for reproducible, low-cost benchmarking.

Comprehensive evaluation of 8 LLMs and controlled analyses of truncation, retrieval and summarization compression, and memorization vs. context use.

Key Findings

LongBench covers 21 datasets, 6 task categories, and 4,750 test instances with long contexts.

Numbers21 datasets; 6 categories; 4,750 instances; avg len 6,711 words (EN), 13,386 chars (ZH).

A commercial model (GPT-3.5-Turbo-16k) outperforms evaluated open-source models overall but still degrades on very long inputs.

NumbersGPT-3.5-Turbo-16k overall ~44.7% (reported overall score); drops −17% from 0–4k to 8k+ on LongBench-E.

Scaling position embeddings and fine-tuning on longer contexts gives large gains for some models.

NumbersChatGLM2-6B-32k and LongChat-v1.5-7B-32k show relative improvements of ~62% and ~19%, respectively (versus shorter/unal‑

Retrieval- and summarization-based compression helps weaker models but is not a full substitute for native long-context modeling.

NumbersBest retrieval gave model improvements of −2%, +21%, and −5% across three example models; summarization helped mainly on

Models often rely partly on memorization; performance can drop sharply when context is withheld.

NumbersContext-withheld drops (∆) vary; example GPT-3.5-Turbo-16k gains +18.9 to +50.3 points when context is present on some Q

Results

Overall macro-average (All tasks)

ValueGPT-3.5-Turbo-16k 44.7%

Performance drop with length (0–4k → 8k+)

ValueGPT-3.5: 51.5 → 42.4 (−17%)

Relative improvement from long-context tuning

ValueChatGLM2-6B-32k: +62% (relative); LongChat-v1.5-7B-32k: +19% (relative)

Baselineshorter-context variants

Retrieval compression best-case impact

ValueModel improvements: −2%, +21%, −5% (example trio under best retriever)

Baselineno retrieval

Who Should Care

What To Try In 7 Days

Run LongBench (or LongBench-E) on your model to profile long-context failure modes.

If you can retrain or fine-tune, try position-scaling (RoPE interpolation) or continued training on longer sequences and compare on LongBench-E.

For deployed models that can't be re-trained, add retrieval or chunked summarization and measure gains on the QA and summarization subsets.

Optimization Features

Token Efficiency

  • chunking and top-N retrieval to reduce input length

Model Optimization

  • position-embedding scaling (RoPE interpolation)

Training Optimization

  • continued training on longer context sequences

Inference Optimization

  • retrieval-based context compression
  • summary-based context compression

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Automated metrics (ROUGE-L, F1, EditSim) can misjudge quality, especially for long or verbose outputs.
  • Performance mixes long-context ability with instruction-following; separating them is nontrivial.
  • Some datasets originate from public corpora and may overlap with model pretraining, causing memorization confounds.

When Not To Use

  • When you need human judgment for nuanced summaries or subjective quality.
  • When your application is short-context only and does not require >4k tokens.
  • When instruction-following is the sole target and long context is irrelevant.

Failure Modes

  • Models may rely on memorization rather than using the provided context.
  • Retrieval/compression can omit critical evidence, causing incorrect answers.
  • Automatic metrics may understate models that produce longer but correct answers.

Core Entities

Models

  • GPT-3.5-Turbo-16k
  • Llama2-7B-chat-4k
  • LongChat-v1.5-7B-32k
  • XGen-7B-8k
  • InternLM-7B-8k
  • ChatGLM2-6B
  • ChatGLM2-6B-32k
  • Vicuna-v1.5-7B-16k

Metrics

  • F1
  • ROUGE-L
  • Edit Sim
  • Accuracy
  • Exact Match (EM)

Datasets

  • LongBench
  • LongBench-E
  • NarrativeQA
  • Qasper
  • MultiFieldQA-en
  • MultiFieldQA-zh
  • HotpotQA
  • 2WikiMultihopQA
  • MuSiQue
  • DuReader
  • GovReport
  • QMSum
  • MultiNews
  • VCSUM
  • TREC
  • TriviaQA
  • SAMSum
  • LSHT
  • PassageCount
  • PassageRetrieval-en
  • PassageRetrieval-zh
  • LCC
  • RepoBench-P

Benchmarks

  • LongBench
  • LongBench-E

Context Entities

Models

  • ChatGLM2-6B (base)
  • Llama2 family

Benchmarks

  • ZEROscrolls/Scrolls
  • L-Eval