LongBench — 21 long-text tasks (Chinese+English) to measure LLMs' long-context understanding up to tens of thousands of tokens

Overview

Decision SnapshotReady For Pilot

LongBench is well-engineered for comparative evaluation and length stress-testing; results are driven by automatic metrics and a controlled model set, so apply human checks for final decisions.

Citations8

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 55%

Authors

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you process long documents (reports, legal files, code repos), LongBench measures real-world long-context ability and shows whether models truly use long inputs or just memorize shortcuts.

Who Should Care

CTO ML Engineer Product Manager Data Scientist

Summary TLDR

LongBench is a bilingual benchmark for evaluating how well language models use very long contexts. It bundles 21 tasks across six categories (QA, multi-doc QA, summarization, few-shot in-context learning, synthetic tests, and code completion) into a unified format with automatic scoring. The authors evaluate 8 popular models on 4,750 test instances (avg length 6,711 words English, 13,386 characters Chinese). Main takeaways: GPT-3.5-Turbo-16k leads among evaluated models but declines on very long inputs; position-scaling and fine-tuning for long contexts help substantially; retrieval or summarization compression helps weaker models but does not replace native long-context training.

Problem Statement

Most LLMs handle only a few thousand tokens, yet real inputs (books, reports, repos) require thousands to tens of thousands of tokens. There is no comprehensive bilingual, multitask benchmark focused on long-context usage to compare models and methods that extend context length.

Main Contribution

LongBench: a bilingual (English/Chinese) benchmark with 21 datasets across 6 task categories for long-context evaluation.

LongBench-E: a length-balanced subset to analyze performance by context length (0–4k, 4k–8k, 8k+).

Key Findings

LongBench covers 21 datasets, 6 task categories, and 4,750 test instances with long contexts.

Numbers21 datasets; 6 categories; 4,750 instances; avg len 6,711 words (EN), 13,386 chars (ZH).

Practical UseUse LongBench to test real long-context scenarios across languages and task styles before deploying models on long-document workloads.

Evidence RefAbstract; Table 1

A commercial model (GPT-3.5-Turbo-16k) outperforms evaluated open-source models overall but still degrades on very long inputs.

NumbersGPT-3.5-Turbo-16k overall ~44.7% (reported overall score); drops −17% from 0–4k to 8k+ on LongBench-E.

Practical UseExpect better baseline performance from high-quality commercial models, but plan for additional engineering (fine-tuning or context strategies) when inputs exceed a few thousand tokens.

Evidence RefTables 2–3; LongBench-E Table 9; Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall macro-average (All tasks)	GPT-3.5-Turbo-16k 44.7%	—	—	LongBench (all tasks)	Table 3 overall 'All' column for GPT-3.5-Turbo-16k	Table 3
Performance drop with length (0–4k → 8k+)	GPT-3.5: 51.5 → 42.4 (−17%)	—	-17%	LongBench-E macro-average	LongBench-E Table 9; Figure 3	Table 9

What To Try In 7 Days

Run LongBench (or LongBench-E) on your model to profile long-context failure modes.

If you can retrain or fine-tune, try position-scaling (RoPE interpolation) or continued training on longer sequences and compare on LongBench-E.

For deployed models that can't be re-trained, add retrieval or chunked summarization and measure gains on the QA and summarization subsets.

Optimization Features

Token Efficiency

chunking and top-N retrieval to reduce input length

Model Optimization

position-embedding scaling (RoPE interpolation)

Training Optimization

continued training on longer context sequences

Inference Optimization

retrieval-based context compressionsummary-based context compression

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/THUDM/LongBench

Data URLs

https://github.com/THUDM/LongBench

Risks & Boundaries

Limitations

Automated metrics (ROUGE-L, F1, EditSim) can misjudge quality, especially for long or verbose outputs.

Performance mixes long-context ability with instruction-following; separating them is nontrivial.

When Not To Use

When you need human judgment for nuanced summaries or subjective quality.

When your application is short-context only and does not require >4k tokens.

Failure Modes

Models may rely on memorization rather than using the provided context.

Retrieval/compression can omit critical evidence, causing incorrect answers.

Core Entities

Models

GPT-3.5-Turbo-16kLlama2-7B-chat-4kLongChat-v1.5-7B-32kXGen-7B-8kInternLM-7B-8kChatGLM2-6BChatGLM2-6B-32kVicuna-v1.5-7B-16k

Metrics

F1ROUGE-LEdit SimAccuracyExact Match (EM)

Datasets

LongBenchLongBench-ENarrativeQAQasperMultiFieldQA-enMultiFieldQA-zhHotpotQA2WikiMultihopQAMuSiQueDuReaderGovReportQMSumMultiNewsVCSUMTRECTriviaQASAMSumLSHTPassageCountPassageRetrieval-enPassageRetrieval-zhLCCRepoBench-P

Benchmarks

LongBenchLongBench-E

Context Entities

Models

ChatGLM2-6B (base)Llama2 family

Benchmarks

ZEROscrolls/ScrollsL-Eval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LongBench covers 21 datasets, 6 task categories, and 4,750 test instances with long contexts.

A commercial model (GPT-3.5-Turbo-16k) outperforms evaluated open-source models overall but still degrades on very long inputs.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding