Long-context LLMs fail to learn reliably from very long in‑context demonstrations

April 2, 20247 min

Overview

Decision SnapshotNeeds Validation

The paper offers a focused, reproducible benchmark and clear failure cases; results are strong for the claimed task but the benchmark covers one evaluation type only.

Citations17

Evidence Strength0.70

Confidence0.86

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

License: MIT (dataset); code repo linked

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 50%

Authors

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you depend on LLM few‑shot prompts for fine‑grained classification in long documents, current long‑context LLMs are unreliable; plan to fine‑tune or add retrieval/structured classifiers instead.

Who Should Care

Summary TLDR

The authors build LongICLBench, a 6-task benchmark that stresses long in‑context learning (ICL) on extreme-label classification (28–174 classes; prompts 2K–50K tokens). They evaluate ~15 long‑context models (open-source and API) and find: many open models degrade or plateau as demonstrations grow; RNN-like long models (RWKV, Mamba) lag behind Transformers; top API models still outperform open models; extreme tasks (Discovery, 174 labels) yield near‑zero accuracy for most LLMs (Gemini ≈14%), while a fine‑tuned BERT gets 87%. They also show a strong position bias: grouped vs scattered example ordering can drop accuracy by 20–46% for some models. Practical takeaway: don’t rely on raw ICL for fin

Problem Statement

Current long‑context evaluations focus on perplexity, synthetic passkey tasks, or summarization—metrics that let models take shortcuts and don’t require reading and reasoning over the entire long prompt. The paper asks: can today’s long‑context LLMs perform in‑context learning when the demonstration alone is extremely long and contains many labels?

Main Contribution

Introduce LongICLBench: six extreme-label classification tasks designed for long in‑context learning (28–174 classes; 2K–50K token demos).

Comprehensive evaluation of ~15 long‑context LLMs (open and API) showing broad failures as difficulty increases.

Key Findings

On the hardest task (Discovery, 174 labels), almost all evaluated LLMs score ~0% accuracy; Gemini‑1.5‑Pro achieves 14% while a fine‑tuned BERT reaches 87%.

NumbersDiscovery: most models 0%; Gemini 14%; BERT fine-tuned 87%

Practical UseDo not expect reliable zero‑shot or few‑shot ICL for very large label spaces—use fine‑tuning, retrieval, or specialized classifiers for extreme-label tasks.

Evidence RefTable 6; main text (Section 3.3)

API models (e.g., GPT4‑turbo) consistently outperform most open‑source long‑context models across datasets; e.g., TacRED: GPT4‑turbo 84.2 F1 vs Mistral ∼42 F1.

NumbersTacRED: GPT4‑turbo 84.2 vs Mistral 42.3 (Table 4)

Practical UseIf accuracy is critical and budget allows, prefer strong API models or fine‑tune; open models may need further adaptation to match performance.

Evidence RefTable 4; Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyDiscovery: most evaluated LLMs ≈0%; Gemini‑1.5‑Pro 14%Fine‑tuned BERT 87%-73% vs BERT (on Discovery)Discovery (174 labels)Table 6; Section 3.3Table 6
F1TacRED: GPT4‑turbo 84.2 F1; Mistral ~42.3 F1SoTA (DeepStruct) ~76.8 F1 (task‑specific)+7.4 vs SoTA for GPT4‑turbo; −34.5 vs GPT4‑turbo for MistralTacREDTable 4; Section 3.3Table 4

What To Try In 7 Days

Run LongICLBench on your model to measure real long‑prompt ICL performance.

Experiment with scattering vs grouping examples; randomize example order and measure sensitivity.

For extreme label sets, prototype a fine‑tuned classifier or RAG pipeline instead of pure ICL.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseMIT (dataset); code repo linked

Risks & Boundaries

Limitations

LongICLBench focuses only on extreme‑label classification; other long‑context tasks may behave differently.

Evaluation mixes open‑source and closed API models, so differences can reflect both architecture and instruction tuning or downstream updates.

When Not To Use

Do not treat LongICLBench results as a full proxy for summarization or retrieval performance.

Avoid using these ICL results to predict performance on generation or multimodal long tasks.

Failure Modes

Position bias: models favor labels in particular prompt positions.

Label-space scale: performance collapses when label count grows large (>>100).

Core Entities

Models

Gemma-7B-baseLLaMA-2-7B-32KChatGLM3-6B-32KQwen-1.5-7B-baseMistral-7B-v0.2-baseLoRAYi-6B-200KInternLM2-7B-baseLong-LLaMA-code-7BRWKV-5-WorldMamba-2.8BGPT4-turboGPT4oClaude3-OpusGemini-1.5-Pro

Metrics

AccuracyF1

Datasets

GoEmotionsBANKING77TacREDFew-NERDDialogREDiscovery

Benchmarks

LongICLBench