Pick a small, high-impact set of unlabeled examples to label using graph diffusion and boost in‑context learning.

Overview

Decision SnapshotNeeds Validation

IDEAL is a practical selection method: it is unsupervised, easy to implement with embeddings, has provable greedy guarantees, and shows consistent empirical gains and large selection-time reductions versus prior baselines.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Shaokun Zhang, Xiaobo Xia, Zhaoqing Wang, Ling-Hao Chen, Jiale Liu, Qingyun Wu, Tongliang Liu

Links

Abstract / PDF / Code

Why It Matters For Business

Label fewer examples and get nearly the same or better in-context performance while cutting selection time and inference cost; this lowers annotation bills and speeds up prompt curation.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

IDEAL is an unsupervised method to choose which unlabeled examples to annotate so that those labeled examples serve as strong in-context prompts for large language models. It builds a directed similarity graph from embeddings, measures a candidate subset's reach via a diffusion (influence) model, and greedily picks examples with the largest marginal influence. IDEAL matches or beats prior selective-annotation baselines on 9 datasets (17/18 cases) while using roughly 13% of the subset-selection time of the prior method (≈7.8× speedup). The paper includes a provable greedy approximation bound and shows Auto-IDEAL (automatic label propagation) can further expand prompts cheaply.

Problem Statement

In-context learning needs many annotated prompts but manual annotation is costly. How do we choose a small subset to label that gives good prompts for many test inputs while minimizing annotation and selection costs?

Main Contribution

An unsupervised, end-to-end selective annotation method (IDEAL) that picks unlabeled examples to annotate by maximizing a graph-based influence metric.

A practical algorithm: build a directed k-NN graph on Sentence‑BERT embeddings, quantify subset influence via an independent-cascade diffusion, and greedily select items by marginal gain.

Key Findings

IDEAL outperforms Votek and random selection in most evaluations.

NumbersBetter in 17 out of 18 eval cases across 9 datasets

Practical UseLabeling the IDEAL-selected subset yields stronger in-context prompts than prior selection rules in practice; prefer IDEAL when you can compute embeddings.

Evidence RefTable 1; §4.2

Subset selection time is much lower than prior work (Votek).

NumbersIDEAL uses ~13% of Votek's time (≈7.8× speedup)

Practical UseExpect far lower compute/inference costs for selection because IDEAL is unsupervised and avoids generating predictions on large unlabeled pools.

Evidence RefFigure 3; §4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	IDEAL 66.4%	Votek 64.6%	+1.8 pp	MRPC	Table 1 (budget=100)	Table 1
Accuracy	IDEAL 51.4%	Votek 46.6%	+4.8 pp	SST-5	Table 1 (budget=100)	Table 1

What To Try In 7 Days

Compute Sentence-BERT embeddings for 3k unlabeled points and build a directed k-NN graph (k=10).

Run IDEAL's greedy influence selection to pick m examples, label them, then use similarity-based retrieval as prompts.

Compare prompt accuracy and selection compute against random selection and your current pipeline; measure selection time and token costs.

Optimization Features

Token Efficiency

Lower token usage during selection since no model completions are required for unlabeled points

Inference Optimization

Reduces selection-stage inference calls by avoiding LLM predictions over the unlabeled pool; reporte

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://skzhang1.github.io/IDEAL/

Risks & Boundaries

Limitations

Requires good embeddings: poor embedding quality harms the graph and selection.

Memory for LLM inference still large: loading a 6B model needs ≈23GB GPU memory.

When Not To Use

When you lack reliable sentence embeddings for your domain.

When you cannot afford any predictions for Auto-annotation but require expanded labels.

Failure Modes

Embedding bias selects semantically similar but label-skewed examples, reducing downstream accuracy on some classes.

Graph connectivity issues (isolated nodes) limit diffusion, causing poor influence estimates.

Core Entities

Models

GPT-J 6BGPT-Neo 2.7BGPT-3.5-TurboText-davinci-002

Metrics

AccuracyROUGE-L

Datasets

MRPCSST-5MNLIDBpediaRTEHellaSwagMWoZGeoQueryXsumSST-2BoolQIMDbBoolQ Contrast Set

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

IDEAL outperforms Votek and random selection in most evaluations.

Subset selection time is much lower than prior work (Votek).

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

BLPO: jointly optimize judge and caption prompts to better align multimodal LLM judges with human image judgments

Key finding

AutoPDL: AutoML that finds and returns editable, executable prompt programs for LLM agents

Key finding

Use evolutionary search to generate harmless prompts that trigger unnecessary LLM refusals, build tests and alignment data, and reduce over‑

Key finding

Find a model's true knowledge boundary by optimizing prompts that preserve meaning

Key finding

IPOMP: pick a small, diverse evaluation set and refine it from live model feedback to get better and more stable prompts

Key finding