Pick a small, high-impact set of unlabeled examples to label using graph diffusion and boost in‑context learning.

October 16, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Shaokun Zhang, Xiaobo Xia, Zhaoqing Wang, Ling-Hao Chen, Jiale Liu, Qingyun Wu, Tongliang Liu

Links

Abstract / PDF

Why It Matters For Business

Label fewer examples and get nearly the same or better in-context performance while cutting selection time and inference cost; this lowers annotation bills and speeds up prompt curation.

Summary TLDR

IDEAL is an unsupervised method to choose which unlabeled examples to annotate so that those labeled examples serve as strong in-context prompts for large language models. It builds a directed similarity graph from embeddings, measures a candidate subset's reach via a diffusion (influence) model, and greedily picks examples with the largest marginal influence. IDEAL matches or beats prior selective-annotation baselines on 9 datasets (17/18 cases) while using roughly 13% of the subset-selection time of the prior method (≈7.8× speedup). The paper includes a provable greedy approximation bound and shows Auto-IDEAL (automatic label propagation) can further expand prompts cheaply.

Problem Statement

In-context learning needs many annotated prompts but manual annotation is costly. How do we choose a small subset to label that gives good prompts for many test inputs while minimizing annotation and selection costs?

Main Contribution

An unsupervised, end-to-end selective annotation method (IDEAL) that picks unlabeled examples to annotate by maximizing a graph-based influence metric.

A practical algorithm: build a directed k-NN graph on Sentence‑BERT embeddings, quantify subset influence via an independent-cascade diffusion, and greedily select items by marginal gain.

Theoretical guarantee: the greedy selection attains a provable lower-bound fraction of the optimal influence (approaches 1 - 1/e as budget grows).

Empirical wins across 9 datasets and multiple LLMs: better prompt quality and ~7.8× faster selection time than the prior SOTA (Votek).

Key Findings

IDEAL outperforms Votek and random selection in most evaluations.

NumbersBetter in 17 out of 18 eval cases across 9 datasets

Subset selection time is much lower than prior work (Votek).

NumbersIDEAL uses ~13% of Votek's time (≈7.8× speedup)

Influence correlates with in-context performance.

NumbersHigher-influence subsets show better average, median, and worst-case performance in sampled tests

Auto-IDEAL (automatic label diffusion) can increase final prompt set cheaply.

NumbersAuto-IDEAL improves over IDEAL in 4 of 5 classification cases (Table 5)

Results

Accuracy

ValueIDEAL 66.4%

BaselineVotek 64.6%

Accuracy

ValueIDEAL 51.4%

BaselineVotek 46.6%

Selection time

ValueIDEAL ≈13% of Votek

BaselineVotek 100%

Auto-IDEAL vs IDEAL (classification avg.)

ValueAuto-IDEAL often higher

BaselineIDEAL

Who Should Care

What To Try In 7 Days

Compute Sentence-BERT embeddings for 3k unlabeled points and build a directed k-NN graph (k=10).

Run IDEAL's greedy influence selection to pick m examples, label them, then use similarity-based retrieval as prompts.

Compare prompt accuracy and selection compute against random selection and your current pipeline; measure selection time and token costs.

Optimization Features

Token Efficiency

  • Lower token usage during selection since no model completions are required for unlabeled points

Inference Optimization

  • Reduces selection-stage inference calls by avoiding LLM predictions over the unlabeled pool; reporte

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires good embeddings: poor embedding quality harms the graph and selection.
  • Memory for LLM inference still large: loading a 6B model needs ≈23GB GPU memory.
  • Auto-IDEAL needs extra predictions across unlabeled data and can be costly.
  • The diffusion step is stochastic; influence is averaged over repeated runs to stabilize estimates.

When Not To Use

  • When you lack reliable sentence embeddings for your domain.
  • When you cannot afford any predictions for Auto-annotation but require expanded labels.
  • For extremely small unlabeled pools where random sampling already approximates distribution.

Failure Modes

  • Embedding bias selects semantically similar but label-skewed examples, reducing downstream accuracy on some classes.
  • Graph connectivity issues (isolated nodes) limit diffusion, causing poor influence estimates.
  • Out-of-domain test distributions with no shared structure may reduce transfer from selected subset.

Core Entities

Models

  • GPT-J 6B
  • GPT-Neo 2.7B
  • GPT-3.5-Turbo
  • Text-davinci-002

Metrics

  • Accuracy
  • ROUGE-L

Datasets

  • MRPC
  • SST-5
  • MNLI
  • DBpedia
  • RTE
  • HellaSwag
  • MWoZ
  • GeoQuery
  • Xsum
  • SST-2
  • BoolQ
  • IMDb
  • BoolQ Contrast Set