Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
17
Why It Matters For Business
If you depend on LLM few‑shot prompts for fine‑grained classification in long documents, current long‑context LLMs are unreliable; plan to fine‑tune or add retrieval/structured classifiers instead.
Summary TLDR
The authors build LongICLBench, a 6-task benchmark that stresses long in‑context learning (ICL) on extreme-label classification (28–174 classes; prompts 2K–50K tokens). They evaluate ~15 long‑context models (open-source and API) and find: many open models degrade or plateau as demonstrations grow; RNN-like long models (RWKV, Mamba) lag behind Transformers; top API models still outperform open models; extreme tasks (Discovery, 174 labels) yield near‑zero accuracy for most LLMs (Gemini ≈14%), while a fine‑tuned BERT gets 87%. They also show a strong position bias: grouped vs scattered example ordering can drop accuracy by 20–46% for some models. Practical takeaway: don’t rely on raw ICL for fin
Problem Statement
Current long‑context evaluations focus on perplexity, synthetic passkey tasks, or summarization—metrics that let models take shortcuts and don’t require reading and reasoning over the entire long prompt. The paper asks: can today’s long‑context LLMs perform in‑context learning when the demonstration alone is extremely long and contains many labels?
Main Contribution
Introduce LongICLBench: six extreme-label classification tasks designed for long in‑context learning (28–174 classes; 2K–50K token demos).
Comprehensive evaluation of ~15 long‑context LLMs (open and API) showing broad failures as difficulty increases.
Analysis showing strong position bias and sensitivity to example ordering in long demonstrations.
Key Findings
On the hardest task (Discovery, 174 labels), almost all evaluated LLMs score ~0% accuracy; Gemini‑1.5‑Pro achieves 14% while a fine‑tuned BERT reaches 87%.
API models (e.g., GPT4‑turbo) consistently outperform most open‑source long‑context models across datasets; e.g., TacRED: GPT4‑turbo 84.2 F1 vs Mistral ∼42 F1.
Example ordering matters: grouping same‑label examples nearby can sharply reduce accuracy; Mistral drops −46.5% and GPT4‑turbo −20.3% on TacRED (3‑round).
Adding demonstrations helps up to a point; for BANKING77 LLaMA‑2‑7B‑32K jumps from 30.2% (1R) to 70.4% (2R), but gains plateau or reverse past certain lengths.
Architecture matters: Transformer‑based open models generally beat attention‑free or RNN‑like long models (RWKV, Mamba) on the evaluated tasks.
Results
Accuracy
F1
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run LongICLBench on your model to measure real long‑prompt ICL performance.
Experiment with scattering vs grouping examples; randomize example order and measure sensitivity.
For extreme label sets, prototype a fine‑tuned classifier or RAG pipeline instead of pure ICL.
Reproducibility
License
- MIT (dataset); code repo linked
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- LongICLBench focuses only on extreme‑label classification; other long‑context tasks may behave differently.
- Evaluation mixes open‑source and closed API models, so differences can reflect both architecture and instruction tuning or downstream updates.
- Some models report 0 results or missing support in tables; results depend on model support for long windows and prompt formatting.
When Not To Use
- Do not treat LongICLBench results as a full proxy for summarization or retrieval performance.
- Avoid using these ICL results to predict performance on generation or multimodal long tasks.
Failure Modes
- Position bias: models favor labels in particular prompt positions.
- Label-space scale: performance collapses when label count grows large (>>100).
- Demonstration length: adding more shots can plateau or hurt performance beyond a sweet spot.
Core Entities
Models
- Gemma-7B-base
- LLaMA-2-7B-32K
- ChatGLM3-6B-32K
- Qwen-1.5-7B-base
- Mistral-7B-v0.2-base
- LoRA
- Yi-6B-200K
- InternLM2-7B-base
- Long-LLaMA-code-7B
- RWKV-5-World
- Mamba-2.8B
- GPT4-turbo
- GPT4o
- Claude3-Opus
- Gemini-1.5-Pro
Metrics
- Accuracy
- F1
Datasets
- GoEmotions
- BANKING77
- TacRED
- Few-NERD
- DialogRE
- Discovery
Benchmarks
- LongICLBench

