DrICL: tune objectives and reweight noisy demonstrations to stabilize many-shot in‑context learning

Overview

Decision SnapshotReady For Pilot

DrICL shows reproducible stability gains on multiple open models and 12 tested datasets; it needs fine-tuning compute (8 A100s used) and careful hyperparameter tuning (α, γ, W, S), so it's promising but not turnkey.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 60%

Authors

Xiaoqing Zhang, Ang Lv, Yuhan Liu, Flood Sung, Wei Liu, Jian Luan, Shuo Shang, Xiuying Chen, Rui Yan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you push LLMs to use hundreds of examples, performance can fall; DrICL stabilizes many-shot behavior and reduces variability across tasks, so production systems that batch many examples (search reranking, large retrieval contexts, document clustering) get more predictable results.

Who Should Care

ML Engineer Data Scientist Engineering Lead Product Manager

Summary TLDR

Large language models can get worse when you feed them many in-context examples. The paper diagnoses two causes: (1) training with plain negative log-likelihood (NLL) does not favor many-shot over zero-shot, and (2) adding many demonstrations increases noisy, high-loss examples. The authors propose DrICL: (a) a global 'differentiated learning' objective that forces many-shot loss below zero-shot loss, and (b) a local advantage-based reweighting that downweights noisy demonstrations using cumulative advantage (an RL-inspired reward). They release ICL-50, a 50-task many-shot dataset, and show DrICL reduces performance variance and yields more stable or better accuracy across many-shot ranges (

Problem Statement

When you increase the number of in-context examples (k) into the hundreds, LLM performance often stops improving and can decline. Two practical drivers are: the standard NLL training objective does not optimize the trade-off between zero-shot and many-shot, and many-shot contexts accumulate noisy or harmful demonstrations that destabilize learning.

Main Contribution

DrICL: combines a global differentiated objective (encourage many-shot loss < zero-shot loss) with a local advantage-based reweighting of demonstrations.

An advantage-based reweighting algorithm that samples a preceding window, computes a cumulative advantage from loss differences, and multiplies many-shot NLL by that advantage.

Key Findings

DrICL yields lower cross-dataset performance variance than baselines.

Numbersvariance avg DrICL=1.56e-03 vs MetaICL=2.38e-03 (Table 7)

Practical UseExpect more stable accuracy across different k-shot settings; use DrICL when you need consistent behavior as you add many demo examples.

Evidence RefTable 7

DrICL improves reasoning accuracy on GSM8K versus baselines.

NumbersGSM8K average accuracy DrICL=0.29 vs MetaICL=0.25 (Table 5)

Practical UseIf your task needs stepwise math reasoning, DrICL can raise accuracy modestly under many-shot fine-tuning.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
performance variance (across datasets)	1.56e-03 (DrICL average)	2.38e-03 (MetaICL average)	-34% relative reduction	12 evaluated datasets (Table 7)	Table 7 shows average variance for NFT, IT, MetaICL, DrICL	Table 7
Accuracy	avg 0.29 (DrICL)	avg 0.25 (MetaICL)	+0.04 absolute	GSM8K (Table 5)	Table 5 reports DrICL AVG=0.29 vs MetaICL=0.25	Table 5

What To Try In 7 Days

Run a controlled fine-tune with DrICL on one model and task: enable differentiated loss with α≈0.2–0.4 and reweighting window W≈10.

Measure performance variance across k values (0,1,3,5,10,20,50) before/after to confirm stability gains.

Start with sampling size S=1 and γ≈11 to compute cumulative advantage; monitor training loss stability.

Optimization Features

Training Optimization

Differentiated learning objective (trade-off many-shot vs zero-shot)Advantage-based reweighting of training examples

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/xiaoqzhwhu/DrICL

Data URLs

https://github.com/xiaoqzhwhu/DrICL

Risks & Boundaries

Limitations

Robustness across dataset sizes not fully analyzed—performance may vary with very small or very large task datasets (Limitations section).

Uniform reweighting window may oversample short-text tasks or undersample long-text tasks; dynamic windowing not yet implemented.

When Not To Use

When you cannot afford fine-tuning compute or have strictly zero-shot deployment requirements.

For tiny datasets where many-shot meta-train examples are unavailable.

Failure Modes

Poor hyperparameter choices (α, γ, W, S) can either undercut many-shot gains or cause weight explosion; paper reports best γ≈11 and S=1.

If many demonstrations are uniformly bad, advantage reweighting may not salvage performance.

Core Entities

Models

Llama-2-7b-chat-hfMistral-7B-Instruct-v0.2

Metrics

AccuracyROUGE (R1)BLEU (B1)Distinct-3 (D3)Precision@kRecall@knDCG@kperformance variance

Datasets

ICL-50CLSClusteringS2SGSM8KXSUMCNN/DailyMailOpenbookQAARCcMedQATREC-COVIDEcomRetrievalVideoRetrieval

Benchmarks

ICL-50

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

DrICL yields lower cross-dataset performance variance than baselines.

DrICL improves reasoning accuracy on GSM8K versus baselines.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding