Overview
The paper offers a focused, reproducible benchmark and clear failure cases; results are strong for the claimed task but the benchmark covers one evaluation type only.
Citations17
Evidence Strength0.70
Confidence0.86
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
License: MIT (dataset); code repo linked
At A Glance
Cost impact: 40%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
If you depend on LLM few‑shot prompts for fine‑grained classification in long documents, current long‑context LLMs are unreliable; plan to fine‑tune or add retrieval/structured classifiers instead.
Who Should Care
Summary TLDR
The authors build LongICLBench, a 6-task benchmark that stresses long in‑context learning (ICL) on extreme-label classification (28–174 classes; prompts 2K–50K tokens). They evaluate ~15 long‑context models (open-source and API) and find: many open models degrade or plateau as demonstrations grow; RNN-like long models (RWKV, Mamba) lag behind Transformers; top API models still outperform open models; extreme tasks (Discovery, 174 labels) yield near‑zero accuracy for most LLMs (Gemini ≈14%), while a fine‑tuned BERT gets 87%. They also show a strong position bias: grouped vs scattered example ordering can drop accuracy by 20–46% for some models. Practical takeaway: don’t rely on raw ICL for fin
Problem Statement
Current long‑context evaluations focus on perplexity, synthetic passkey tasks, or summarization—metrics that let models take shortcuts and don’t require reading and reasoning over the entire long prompt. The paper asks: can today’s long‑context LLMs perform in‑context learning when the demonstration alone is extremely long and contains many labels?
Main Contribution
Introduce LongICLBench: six extreme-label classification tasks designed for long in‑context learning (28–174 classes; 2K–50K token demos).
Comprehensive evaluation of ~15 long‑context LLMs (open and API) showing broad failures as difficulty increases.
Key Findings
On the hardest task (Discovery, 174 labels), almost all evaluated LLMs score ~0% accuracy; Gemini‑1.5‑Pro achieves 14% while a fine‑tuned BERT reaches 87%.
API models (e.g., GPT4‑turbo) consistently outperform most open‑source long‑context models across datasets; e.g., TacRED: GPT4‑turbo 84.2 F1 vs Mistral ∼42 F1.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | Discovery: most evaluated LLMs ≈0%; Gemini‑1.5‑Pro 14% | Fine‑tuned BERT 87% | -73% vs BERT (on Discovery) | Discovery (174 labels) | Table 6; Section 3.3 | Table 6 |
| F1 | TacRED: GPT4‑turbo 84.2 F1; Mistral ~42.3 F1 | SoTA (DeepStruct) ~76.8 F1 (task‑specific) | +7.4 vs SoTA for GPT4‑turbo; −34.5 vs GPT4‑turbo for Mistral | TacRED | Table 4; Section 3.3 | Table 4 |
What To Try In 7 Days
Run LongICLBench on your model to measure real long‑prompt ICL performance.
Experiment with scattering vs grouping examples; randomize example order and measure sensitivity.
For extreme label sets, prototype a fine‑tuned classifier or RAG pipeline instead of pure ICL.
Reproducibility
Risks & Boundaries
Limitations
LongICLBench focuses only on extreme‑label classification; other long‑context tasks may behave differently.
Evaluation mixes open‑source and closed API models, so differences can reflect both architecture and instruction tuning or downstream updates.
When Not To Use
Do not treat LongICLBench results as a full proxy for summarization or retrieval performance.
Avoid using these ICL results to predict performance on generation or multimodal long tasks.
Failure Modes
Position bias: models favor labels in particular prompt positions.
Label-space scale: performance collapses when label count grows large (>>100).

