Overview
Production Readiness
0.7
Novelty Score
0.55
Cost Impact Score
0.5
Citation Count
8
Why It Matters For Business
If you process long documents (reports, legal files, code repos), LongBench measures real-world long-context ability and shows whether models truly use long inputs or just memorize shortcuts.
Summary TLDR
LongBench is a bilingual benchmark for evaluating how well language models use very long contexts. It bundles 21 tasks across six categories (QA, multi-doc QA, summarization, few-shot in-context learning, synthetic tests, and code completion) into a unified format with automatic scoring. The authors evaluate 8 popular models on 4,750 test instances (avg length 6,711 words English, 13,386 characters Chinese). Main takeaways: GPT-3.5-Turbo-16k leads among evaluated models but declines on very long inputs; position-scaling and fine-tuning for long contexts help substantially; retrieval or summarization compression helps weaker models but does not replace native long-context training.
Problem Statement
Most LLMs handle only a few thousand tokens, yet real inputs (books, reports, repos) require thousands to tens of thousands of tokens. There is no comprehensive bilingual, multitask benchmark focused on long-context usage to compare models and methods that extend context length.
Main Contribution
LongBench: a bilingual (English/Chinese) benchmark with 21 datasets across 6 task categories for long-context evaluation.
LongBench-E: a length-balanced subset to analyze performance by context length (0–4k, 4k–8k, 8k+).
Standardized format and automated evaluation (ROUGE-L, F1, EditSim) for reproducible, low-cost benchmarking.
Comprehensive evaluation of 8 LLMs and controlled analyses of truncation, retrieval and summarization compression, and memorization vs. context use.
Key Findings
LongBench covers 21 datasets, 6 task categories, and 4,750 test instances with long contexts.
A commercial model (GPT-3.5-Turbo-16k) outperforms evaluated open-source models overall but still degrades on very long inputs.
Scaling position embeddings and fine-tuning on longer contexts gives large gains for some models.
Retrieval- and summarization-based compression helps weaker models but is not a full substitute for native long-context modeling.
Models often rely partly on memorization; performance can drop sharply when context is withheld.
Results
Overall macro-average (All tasks)
Performance drop with length (0–4k → 8k+)
Relative improvement from long-context tuning
Retrieval compression best-case impact
Who Should Care
What To Try In 7 Days
Run LongBench (or LongBench-E) on your model to profile long-context failure modes.
If you can retrain or fine-tune, try position-scaling (RoPE interpolation) or continued training on longer sequences and compare on LongBench-E.
For deployed models that can't be re-trained, add retrieval or chunked summarization and measure gains on the QA and summarization subsets.
Optimization Features
Token Efficiency
- chunking and top-N retrieval to reduce input length
Model Optimization
- position-embedding scaling (RoPE interpolation)
Training Optimization
- continued training on longer context sequences
Inference Optimization
- retrieval-based context compression
- summary-based context compression
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Automated metrics (ROUGE-L, F1, EditSim) can misjudge quality, especially for long or verbose outputs.
- Performance mixes long-context ability with instruction-following; separating them is nontrivial.
- Some datasets originate from public corpora and may overlap with model pretraining, causing memorization confounds.
When Not To Use
- When you need human judgment for nuanced summaries or subjective quality.
- When your application is short-context only and does not require >4k tokens.
- When instruction-following is the sole target and long context is irrelevant.
Failure Modes
- Models may rely on memorization rather than using the provided context.
- Retrieval/compression can omit critical evidence, causing incorrect answers.
- Automatic metrics may understate models that produce longer but correct answers.
Core Entities
Models
- GPT-3.5-Turbo-16k
- Llama2-7B-chat-4k
- LongChat-v1.5-7B-32k
- XGen-7B-8k
- InternLM-7B-8k
- ChatGLM2-6B
- ChatGLM2-6B-32k
- Vicuna-v1.5-7B-16k
Metrics
- F1
- ROUGE-L
- Edit Sim
- Accuracy
- Exact Match (EM)
Datasets
- LongBench
- LongBench-E
- NarrativeQA
- Qasper
- MultiFieldQA-en
- MultiFieldQA-zh
- HotpotQA
- 2WikiMultihopQA
- MuSiQue
- DuReader
- GovReport
- QMSum
- MultiNews
- VCSUM
- TREC
- TriviaQA
- SAMSum
- LSHT
- PassageCount
- PassageRetrieval-en
- PassageRetrieval-zh
- LCC
- RepoBench-P
Benchmarks
- LongBench
- LongBench-E
Context Entities
Models
- ChatGLM2-6B (base)
- Llama2 family
Benchmarks
- ZEROscrolls/Scrolls
- L-Eval

