Long-context LLMs fail to learn reliably from very long in‑context demonstrations

Overview

Decision SnapshotNeeds Validation

The paper offers a focused, reproducible benchmark and clear failure cases; results are strong for the claimed task but the benchmark covers one evaluation type only.

Citations17

Evidence Strength0.70

Confidence0.86

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

License: MIT (dataset); code repo linked

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 50%

Authors

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you depend on LLM few‑shot prompts for fine‑grained classification in long documents, current long‑context LLMs are unreliable; plan to fine‑tune or add retrieval/structured classifiers instead.

Who Should Care

ML Engineer Product Manager CTO Data Scientist

Summary TLDR

The authors build LongICLBench, a 6-task benchmark that stresses long in‑context learning (ICL) on extreme-label classification (28–174 classes; prompts 2K–50K tokens). They evaluate ~15 long‑context models (open-source and API) and find: many open models degrade or plateau as demonstrations grow; RNN-like long models (RWKV, Mamba) lag behind Transformers; top API models still outperform open models; extreme tasks (Discovery, 174 labels) yield near‑zero accuracy for most LLMs (Gemini ≈14%), while a fine‑tuned BERT gets 87%. They also show a strong position bias: grouped vs scattered example ordering can drop accuracy by 20–46% for some models. Practical takeaway: don’t rely on raw ICL for ﬁn

Problem Statement

Current long‑context evaluations focus on perplexity, synthetic passkey tasks, or summarization—metrics that let models take shortcuts and don’t require reading and reasoning over the entire long prompt. The paper asks: can today’s long‑context LLMs perform in‑context learning when the demonstration alone is extremely long and contains many labels?

Main Contribution

Introduce LongICLBench: six extreme-label classification tasks designed for long in‑context learning (28–174 classes; 2K–50K token demos).

Comprehensive evaluation of ~15 long‑context LLMs (open and API) showing broad failures as difficulty increases.

Key Findings

On the hardest task (Discovery, 174 labels), almost all evaluated LLMs score ~0% accuracy; Gemini‑1.5‑Pro achieves 14% while a fine‑tuned BERT reaches 87%.

NumbersDiscovery: most models 0%; Gemini 14%; BERT fine-tuned 87%

Practical UseDo not expect reliable zero‑shot or few‑shot ICL for very large label spaces—use fine‑tuning, retrieval, or specialized classifiers for extreme-label tasks.

Evidence RefTable 6; main text (Section 3.3)

API models (e.g., GPT4‑turbo) consistently outperform most open‑source long‑context models across datasets; e.g., TacRED: GPT4‑turbo 84.2 F1 vs Mistral ∼42 F1.

NumbersTacRED: GPT4‑turbo 84.2 vs Mistral 42.3 (Table 4)

Practical UseIf accuracy is critical and budget allows, prefer strong API models or fine‑tune; open models may need further adaptation to match performance.

Evidence RefTable 4; Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Discovery: most evaluated LLMs ≈0%; Gemini‑1.5‑Pro 14%	Fine‑tuned BERT 87%	-73% vs BERT (on Discovery)	Discovery (174 labels)	Table 6; Section 3.3	Table 6
F1	TacRED: GPT4‑turbo 84.2 F1; Mistral ~42.3 F1	SoTA (DeepStruct) ~76.8 F1 (task‑specific)	+7.4 vs SoTA for GPT4‑turbo; −34.5 vs GPT4‑turbo for Mistral	TacRED	Table 4; Section 3.3	Table 4

What To Try In 7 Days

Run LongICLBench on your model to measure real long‑prompt ICL performance.

Experiment with scattering vs grouping examples; randomize example order and measure sensitivity.

For extreme label sets, prototype a fine‑tuned classifier or RAG pipeline instead of pure ICL.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseMIT (dataset); code repo linked

Code URLs

https://github.com/TIGER-AI-Lab/LongICLBench

Data URLs

https://github.com/TIGER-AI-Lab/LongICLBench

Risks & Boundaries

Limitations

LongICLBench focuses only on extreme‑label classification; other long‑context tasks may behave differently.

Evaluation mixes open‑source and closed API models, so differences can reflect both architecture and instruction tuning or downstream updates.

When Not To Use

Do not treat LongICLBench results as a full proxy for summarization or retrieval performance.

Avoid using these ICL results to predict performance on generation or multimodal long tasks.

Failure Modes

Position bias: models favor labels in particular prompt positions.

Label-space scale: performance collapses when label count grows large (>>100).

Core Entities

Models

Gemma-7B-baseLLaMA-2-7B-32KChatGLM3-6B-32KQwen-1.5-7B-baseMistral-7B-v0.2-baseLoRAYi-6B-200KInternLM2-7B-baseLong-LLaMA-code-7BRWKV-5-WorldMamba-2.8BGPT4-turboGPT4oClaude3-OpusGemini-1.5-Pro

Metrics

AccuracyF1

Datasets

GoEmotionsBANKING77TacREDFew-NERDDialogREDiscovery

Benchmarks

LongICLBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

On the hardest task (Discovery, 174 labels), almost all evaluated LLMs score ~0% accuracy; Gemini‑1.5‑Pro achieves 14% while a fine‑tuned BERT reaches 87%.

API models (e.g., GPT4‑turbo) consistently outperform most open‑source long‑context models across datasets; e.g., TacRED: GPT4‑turbo 84.2 F1 vs Mistral ∼42 F1.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding