Long-context LLMs fail to learn reliably from very long in‑context demonstrations

April 2, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

17

Authors

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

Links

Abstract / PDF

Why It Matters For Business

If you depend on LLM few‑shot prompts for fine‑grained classification in long documents, current long‑context LLMs are unreliable; plan to fine‑tune or add retrieval/structured classifiers instead.

Summary TLDR

The authors build LongICLBench, a 6-task benchmark that stresses long in‑context learning (ICL) on extreme-label classification (28–174 classes; prompts 2K–50K tokens). They evaluate ~15 long‑context models (open-source and API) and find: many open models degrade or plateau as demonstrations grow; RNN-like long models (RWKV, Mamba) lag behind Transformers; top API models still outperform open models; extreme tasks (Discovery, 174 labels) yield near‑zero accuracy for most LLMs (Gemini ≈14%), while a fine‑tuned BERT gets 87%. They also show a strong position bias: grouped vs scattered example ordering can drop accuracy by 20–46% for some models. Practical takeaway: don’t rely on raw ICL for fin

Problem Statement

Current long‑context evaluations focus on perplexity, synthetic passkey tasks, or summarization—metrics that let models take shortcuts and don’t require reading and reasoning over the entire long prompt. The paper asks: can today’s long‑context LLMs perform in‑context learning when the demonstration alone is extremely long and contains many labels?

Main Contribution

Introduce LongICLBench: six extreme-label classification tasks designed for long in‑context learning (28–174 classes; 2K–50K token demos).

Comprehensive evaluation of ~15 long‑context LLMs (open and API) showing broad failures as difficulty increases.

Analysis showing strong position bias and sensitivity to example ordering in long demonstrations.

Key Findings

On the hardest task (Discovery, 174 labels), almost all evaluated LLMs score ~0% accuracy; Gemini‑1.5‑Pro achieves 14% while a fine‑tuned BERT reaches 87%.

NumbersDiscovery: most models 0%; Gemini 14%; BERT fine-tuned 87%

API models (e.g., GPT4‑turbo) consistently outperform most open‑source long‑context models across datasets; e.g., TacRED: GPT4‑turbo 84.2 F1 vs Mistral ∼42 F1.

NumbersTacRED: GPT4‑turbo 84.2 vs Mistral 42.3 (Table 4)

Example ordering matters: grouping same‑label examples nearby can sharply reduce accuracy; Mistral drops −46.5% and GPT4‑turbo −20.3% on TacRED (3‑round).

NumbersTacRED 3R grouped ∆: Mistral −46.5%, GPT4‑turbo −20.3% (Table 10)

Adding demonstrations helps up to a point; for BANKING77 LLaMA‑2‑7B‑32K jumps from 30.2% (1R) to 70.4% (2R), but gains plateau or reverse past certain lengths.

NumbersBANKING77 LLaMA‑2 1R→2R: 30.2%→70.4% (Table 3)

Architecture matters: Transformer‑based open models generally beat attention‑free or RNN‑like long models (RWKV, Mamba) on the evaluated tasks.

NumbersMultiple tables: RWKV/Mamba near 0–10% vs Transformer models often 20–80%

Results

Accuracy

ValueDiscovery: most evaluated LLMs ≈0%; Gemini‑1.5‑Pro 14%

BaselineFine‑tuned BERT 87%

F1

ValueTacRED: GPT4‑turbo 84.2 F1; Mistral ~42.3 F1

BaselineSoTA (DeepStruct) ~76.8 F1 (task‑specific)

Accuracy

ValueBANKING77: GPT4‑turbo 84.4%; LLaMA‑2‑7B‑32K 77.2% (5R)

BaselineSoTA (RoBERTA+ICDA) 94.4%

Accuracy

ValueTacRED (3R): Mistral drop −46.5% when grouped; GPT4‑turbo drop −20.3%

BaselineScatter ordering

Who Should Care

What To Try In 7 Days

Run LongICLBench on your model to measure real long‑prompt ICL performance.

Experiment with scattering vs grouping examples; randomize example order and measure sensitivity.

For extreme label sets, prototype a fine‑tuned classifier or RAG pipeline instead of pure ICL.

Reproducibility

License

  • MIT (dataset); code repo linked

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • LongICLBench focuses only on extreme‑label classification; other long‑context tasks may behave differently.
  • Evaluation mixes open‑source and closed API models, so differences can reflect both architecture and instruction tuning or downstream updates.
  • Some models report 0 results or missing support in tables; results depend on model support for long windows and prompt formatting.

When Not To Use

  • Do not treat LongICLBench results as a full proxy for summarization or retrieval performance.
  • Avoid using these ICL results to predict performance on generation or multimodal long tasks.

Failure Modes

  • Position bias: models favor labels in particular prompt positions.
  • Label-space scale: performance collapses when label count grows large (>>100).
  • Demonstration length: adding more shots can plateau or hurt performance beyond a sweet spot.

Core Entities

Models

  • Gemma-7B-base
  • LLaMA-2-7B-32K
  • ChatGLM3-6B-32K
  • Qwen-1.5-7B-base
  • Mistral-7B-v0.2-base
  • LoRA
  • Yi-6B-200K
  • InternLM2-7B-base
  • Long-LLaMA-code-7B
  • RWKV-5-World
  • Mamba-2.8B
  • GPT4-turbo
  • GPT4o
  • Claude3-Opus
  • Gemini-1.5-Pro

Metrics

  • Accuracy
  • F1

Datasets

  • GoEmotions
  • BANKING77
  • TacRED
  • Few-NERD
  • DialogRE
  • Discovery

Benchmarks

  • LongICLBench