DrICL: tune objectives and reweight noisy demonstrations to stabilize many-shot in‑context learning

January 7, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.45

Citation Count

0

Authors

Xiaoqing Zhang, Ang Lv, Yuhan Liu, Flood Sung, Wei Liu, Jian Luan, Shuo Shang, Xiuying Chen, Rui Yan

Links

Abstract / PDF

Why It Matters For Business

If you push LLMs to use hundreds of examples, performance can fall; DrICL stabilizes many-shot behavior and reduces variability across tasks, so production systems that batch many examples (search reranking, large retrieval contexts, document clustering) get more predictable results.

Summary TLDR

Large language models can get worse when you feed them many in-context examples. The paper diagnoses two causes: (1) training with plain negative log-likelihood (NLL) does not favor many-shot over zero-shot, and (2) adding many demonstrations increases noisy, high-loss examples. The authors propose DrICL: (a) a global 'differentiated learning' objective that forces many-shot loss below zero-shot loss, and (b) a local advantage-based reweighting that downweights noisy demonstrations using cumulative advantage (an RL-inspired reward). They release ICL-50, a 50-task many-shot dataset, and show DrICL reduces performance variance and yields more stable or better accuracy across many-shot ranges (

Problem Statement

When you increase the number of in-context examples (k) into the hundreds, LLM performance often stops improving and can decline. Two practical drivers are: the standard NLL training objective does not optimize the trade-off between zero-shot and many-shot, and many-shot contexts accumulate noisy or harmful demonstrations that destabilize learning.

Main Contribution

DrICL: combines a global differentiated objective (encourage many-shot loss < zero-shot loss) with a local advantage-based reweighting of demonstrations.

An advantage-based reweighting algorithm that samples a preceding window, computes a cumulative advantage from loss differences, and multiplies many-shot NLL by that advantage.

ICL-50: a large many-shot benchmark of 50 tasks (7 task types, token lengths 10–14k, up to hundreds of thousands of samples) released with code and data.

Key Findings

DrICL yields lower cross-dataset performance variance than baselines.

Numbersvariance avg DrICL=1.56e-03 vs MetaICL=2.38e-03 (Table 7)

DrICL improves reasoning accuracy on GSM8K versus baselines.

NumbersGSM8K average accuracy DrICL=0.29 vs MetaICL=0.25 (Table 5)

DrICL stabilizes and often improves clustering and retrieval at high k values.

NumbersCLSClusteringS2S (Mistral): DrICL AVG=0.84, MAX=0.88 vs MetaICL AVG=0.76 (Table 4)

Results

performance variance (across datasets)

Value1.56e-03 (DrICL average)

Baseline2.38e-03 (MetaICL average)

Accuracy

Valueavg 0.29 (DrICL)

Baselineavg 0.25 (MetaICL)

Accuracy

Valueavg 0.84, max 0.88 (DrICL)

Baselineavg 0.76, max 0.82 (MetaICL)

Who Should Care

What To Try In 7 Days

Run a controlled fine-tune with DrICL on one model and task: enable differentiated loss with α≈0.2–0.4 and reweighting window W≈10.

Measure performance variance across k values (0,1,3,5,10,20,50) before/after to confirm stability gains.

Start with sampling size S=1 and γ≈11 to compute cumulative advantage; monitor training loss stability.

Optimization Features

Training Optimization

  • Differentiated learning objective (trade-off many-shot vs zero-shot)
  • Advantage-based reweighting of training examples

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Robustness across dataset sizes not fully analyzed—performance may vary with very small or very large task datasets (Limitations section).
  • Uniform reweighting window may oversample short-text tasks or undersample long-text tasks; dynamic windowing not yet implemented.
  • Method requires many-shot training data and nontrivial compute (experiments used 8 A100 GPUs).

When Not To Use

  • When you cannot afford fine-tuning compute or have strictly zero-shot deployment requirements.
  • For tiny datasets where many-shot meta-train examples are unavailable.

Failure Modes

  • Poor hyperparameter choices (α, γ, W, S) can either undercut many-shot gains or cause weight explosion; paper reports best γ≈11 and S=1.
  • If many demonstrations are uniformly bad, advantage reweighting may not salvage performance.
  • Windowing mismatch: fixed window size can misrepresent tasks with very different sample lengths.

Core Entities

Models

  • Llama-2-7b-chat-hf
  • Mistral-7B-Instruct-v0.2

Metrics

  • Accuracy
  • ROUGE (R1)
  • BLEU (B1)
  • Distinct-3 (D3)
  • Precision@k
  • Recall@k
  • nDCG@k
  • performance variance

Datasets

  • ICL-50
  • CLSClusteringS2S
  • GSM8K
  • XSUM
  • CNN/DailyMail
  • OpenbookQA
  • ARC
  • cMedQA
  • TREC-COVID
  • EcomRetrieval
  • VideoRetrieval

Benchmarks

  • ICL-50