Pick a small, high-impact set of unlabeled examples to label using graph diffusion and boost in‑context learning.

October 16, 20237 min

Overview

Decision SnapshotNeeds Validation

IDEAL is a practical selection method: it is unsupervised, easy to implement with embeddings, has provable greedy guarantees, and shows consistent empirical gains and large selection-time reductions versus prior baselines.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Shaokun Zhang, Xiaobo Xia, Zhaoqing Wang, Ling-Hao Chen, Jiale Liu, Qingyun Wu, Tongliang Liu

Links

Abstract / PDF / Code

Why It Matters For Business

Label fewer examples and get nearly the same or better in-context performance while cutting selection time and inference cost; this lowers annotation bills and speeds up prompt curation.

Who Should Care

Summary TLDR

IDEAL is an unsupervised method to choose which unlabeled examples to annotate so that those labeled examples serve as strong in-context prompts for large language models. It builds a directed similarity graph from embeddings, measures a candidate subset's reach via a diffusion (influence) model, and greedily picks examples with the largest marginal influence. IDEAL matches or beats prior selective-annotation baselines on 9 datasets (17/18 cases) while using roughly 13% of the subset-selection time of the prior method (≈7.8× speedup). The paper includes a provable greedy approximation bound and shows Auto-IDEAL (automatic label propagation) can further expand prompts cheaply.

Problem Statement

In-context learning needs many annotated prompts but manual annotation is costly. How do we choose a small subset to label that gives good prompts for many test inputs while minimizing annotation and selection costs?

Main Contribution

An unsupervised, end-to-end selective annotation method (IDEAL) that picks unlabeled examples to annotate by maximizing a graph-based influence metric.

A practical algorithm: build a directed k-NN graph on Sentence‑BERT embeddings, quantify subset influence via an independent-cascade diffusion, and greedily select items by marginal gain.

Key Findings

IDEAL outperforms Votek and random selection in most evaluations.

NumbersBetter in 17 out of 18 eval cases across 9 datasets

Practical UseLabeling the IDEAL-selected subset yields stronger in-context prompts than prior selection rules in practice; prefer IDEAL when you can compute embeddings.

Evidence RefTable 1; §4.2

Subset selection time is much lower than prior work (Votek).

NumbersIDEAL uses ~13% of Votek's time (≈7.8× speedup)

Practical UseExpect far lower compute/inference costs for selection because IDEAL is unsupervised and avoids generating predictions on large unlabeled pools.

Evidence RefFigure 3; §4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyIDEAL 66.4%Votek 64.6%+1.8 ppMRPCTable 1 (budget=100)Table 1
AccuracyIDEAL 51.4%Votek 46.6%+4.8 ppSST-5Table 1 (budget=100)Table 1

What To Try In 7 Days

Compute Sentence-BERT embeddings for 3k unlabeled points and build a directed k-NN graph (k=10).

Run IDEAL's greedy influence selection to pick m examples, label them, then use similarity-based retrieval as prompts.

Compare prompt accuracy and selection compute against random selection and your current pipeline; measure selection time and token costs.

Optimization Features

Token Efficiency
Lower token usage during selection since no model completions are required for unlabeled points
Inference Optimization

Reduces selection-stage inference calls by avoiding LLM predictions over the unlabeled pool; reporte

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires good embeddings: poor embedding quality harms the graph and selection.

Memory for LLM inference still large: loading a 6B model needs ≈23GB GPU memory.

When Not To Use

When you lack reliable sentence embeddings for your domain.

When you cannot afford any predictions for Auto-annotation but require expanded labels.

Failure Modes

Embedding bias selects semantically similar but label-skewed examples, reducing downstream accuracy on some classes.

Graph connectivity issues (isolated nodes) limit diffusion, causing poor influence estimates.

Core Entities

Models

GPT-J 6BGPT-Neo 2.7BGPT-3.5-TurboText-davinci-002

Metrics

AccuracyROUGE-L

Datasets

MRPCSST-5MNLIDBpediaRTEHellaSwagMWoZGeoQueryXsumSST-2BoolQIMDbBoolQ Contrast Set