LongBench — 21 long-text tasks (Chinese+English) to measure LLMs' long-context understanding up to tens of thousands of tokens

August 28, 20238 min

Overview

Decision SnapshotReady For Pilot

LongBench is well-engineered for comparative evaluation and length stress-testing; results are driven by automatic metrics and a controlled model set, so apply human checks for final decisions.

Citations8

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 55%

Authors

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you process long documents (reports, legal files, code repos), LongBench measures real-world long-context ability and shows whether models truly use long inputs or just memorize shortcuts.

Who Should Care

Summary TLDR

LongBench is a bilingual benchmark for evaluating how well language models use very long contexts. It bundles 21 tasks across six categories (QA, multi-doc QA, summarization, few-shot in-context learning, synthetic tests, and code completion) into a unified format with automatic scoring. The authors evaluate 8 popular models on 4,750 test instances (avg length 6,711 words English, 13,386 characters Chinese). Main takeaways: GPT-3.5-Turbo-16k leads among evaluated models but declines on very long inputs; position-scaling and fine-tuning for long contexts help substantially; retrieval or summarization compression helps weaker models but does not replace native long-context training.

Problem Statement

Most LLMs handle only a few thousand tokens, yet real inputs (books, reports, repos) require thousands to tens of thousands of tokens. There is no comprehensive bilingual, multitask benchmark focused on long-context usage to compare models and methods that extend context length.

Main Contribution

LongBench: a bilingual (English/Chinese) benchmark with 21 datasets across 6 task categories for long-context evaluation.

LongBench-E: a length-balanced subset to analyze performance by context length (0–4k, 4k–8k, 8k+).

Key Findings

LongBench covers 21 datasets, 6 task categories, and 4,750 test instances with long contexts.

Numbers21 datasets; 6 categories; 4,750 instances; avg len 6,711 words (EN), 13,386 chars (ZH).

Practical UseUse LongBench to test real long-context scenarios across languages and task styles before deploying models on long-document workloads.

Evidence RefAbstract; Table 1

A commercial model (GPT-3.5-Turbo-16k) outperforms evaluated open-source models overall but still degrades on very long inputs.

NumbersGPT-3.5-Turbo-16k overall ~44.7% (reported overall score); drops −17% from 04k to 8k+ on LongBench-E.

Practical UseExpect better baseline performance from high-quality commercial models, but plan for additional engineering (fine-tuning or context strategies) when inputs exceed a few thousand tokens.

Evidence RefTables 2–3; LongBench-E Table 9; Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall macro-average (All tasks)GPT-3.5-Turbo-16k 44.7%LongBench (all tasks)Table 3 overall 'All' column for GPT-3.5-Turbo-16kTable 3
Performance drop with length (0–4k → 8k+)GPT-3.5: 51.542.4 (−17%)-17%LongBench-E macro-averageLongBench-E Table 9; Figure 3Table 9

What To Try In 7 Days

Run LongBench (or LongBench-E) on your model to profile long-context failure modes.

If you can retrain or fine-tune, try position-scaling (RoPE interpolation) or continued training on longer sequences and compare on LongBench-E.

For deployed models that can't be re-trained, add retrieval or chunked summarization and measure gains on the QA and summarization subsets.

Optimization Features

Token Efficiency
chunking and top-N retrieval to reduce input length
Model Optimization
position-embedding scaling (RoPE interpolation)
Training Optimization
continued training on longer context sequences
Inference Optimization
retrieval-based context compressionsummary-based context compression

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Automated metrics (ROUGE-L, F1, EditSim) can misjudge quality, especially for long or verbose outputs.

Performance mixes long-context ability with instruction-following; separating them is nontrivial.

When Not To Use

When you need human judgment for nuanced summaries or subjective quality.

When your application is short-context only and does not require >4k tokens.

Failure Modes

Models may rely on memorization rather than using the provided context.

Retrieval/compression can omit critical evidence, causing incorrect answers.

Core Entities

Models

GPT-3.5-Turbo-16kLlama2-7B-chat-4kLongChat-v1.5-7B-32kXGen-7B-8kInternLM-7B-8kChatGLM2-6BChatGLM2-6B-32kVicuna-v1.5-7B-16k

Metrics

F1ROUGE-LEdit SimAccuracyExact Match (EM)

Datasets

LongBenchLongBench-ENarrativeQAQasperMultiFieldQA-enMultiFieldQA-zhHotpotQA2WikiMultihopQAMuSiQueDuReaderGovReportQMSumMultiNewsVCSUMTRECTriviaQASAMSumLSHTPassageCountPassageRetrieval-enPassageRetrieval-zhLCCRepoBench-P

Benchmarks

LongBenchLongBench-E

Context Entities

Models

ChatGLM2-6B (base)Llama2 family

Benchmarks

ZEROscrolls/ScrollsL-Eval