Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Label fewer examples and get nearly the same or better in-context performance while cutting selection time and inference cost; this lowers annotation bills and speeds up prompt curation.
Summary TLDR
IDEAL is an unsupervised method to choose which unlabeled examples to annotate so that those labeled examples serve as strong in-context prompts for large language models. It builds a directed similarity graph from embeddings, measures a candidate subset's reach via a diffusion (influence) model, and greedily picks examples with the largest marginal influence. IDEAL matches or beats prior selective-annotation baselines on 9 datasets (17/18 cases) while using roughly 13% of the subset-selection time of the prior method (≈7.8× speedup). The paper includes a provable greedy approximation bound and shows Auto-IDEAL (automatic label propagation) can further expand prompts cheaply.
Problem Statement
In-context learning needs many annotated prompts but manual annotation is costly. How do we choose a small subset to label that gives good prompts for many test inputs while minimizing annotation and selection costs?
Main Contribution
An unsupervised, end-to-end selective annotation method (IDEAL) that picks unlabeled examples to annotate by maximizing a graph-based influence metric.
A practical algorithm: build a directed k-NN graph on Sentence‑BERT embeddings, quantify subset influence via an independent-cascade diffusion, and greedily select items by marginal gain.
Theoretical guarantee: the greedy selection attains a provable lower-bound fraction of the optimal influence (approaches 1 - 1/e as budget grows).
Empirical wins across 9 datasets and multiple LLMs: better prompt quality and ~7.8× faster selection time than the prior SOTA (Votek).
Key Findings
IDEAL outperforms Votek and random selection in most evaluations.
Subset selection time is much lower than prior work (Votek).
Influence correlates with in-context performance.
Auto-IDEAL (automatic label diffusion) can increase final prompt set cheaply.
Results
Accuracy
Accuracy
Selection time
Auto-IDEAL vs IDEAL (classification avg.)
Who Should Care
What To Try In 7 Days
Compute Sentence-BERT embeddings for 3k unlabeled points and build a directed k-NN graph (k=10).
Run IDEAL's greedy influence selection to pick m examples, label them, then use similarity-based retrieval as prompts.
Compare prompt accuracy and selection compute against random selection and your current pipeline; measure selection time and token costs.
Optimization Features
Token Efficiency
- Lower token usage during selection since no model completions are required for unlabeled points
Inference Optimization
- Reduces selection-stage inference calls by avoiding LLM predictions over the unlabeled pool; reporte
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires good embeddings: poor embedding quality harms the graph and selection.
- Memory for LLM inference still large: loading a 6B model needs ≈23GB GPU memory.
- Auto-IDEAL needs extra predictions across unlabeled data and can be costly.
- The diffusion step is stochastic; influence is averaged over repeated runs to stabilize estimates.
When Not To Use
- When you lack reliable sentence embeddings for your domain.
- When you cannot afford any predictions for Auto-annotation but require expanded labels.
- For extremely small unlabeled pools where random sampling already approximates distribution.
Failure Modes
- Embedding bias selects semantically similar but label-skewed examples, reducing downstream accuracy on some classes.
- Graph connectivity issues (isolated nodes) limit diffusion, causing poor influence estimates.
- Out-of-domain test distributions with no shared structure may reduce transfer from selected subset.
Core Entities
Models
- GPT-J 6B
- GPT-Neo 2.7B
- GPT-3.5-Turbo
- Text-davinci-002
Metrics
- Accuracy
- ROUGE-L
Datasets
- MRPC
- SST-5
- MNLI
- DBpedia
- RTE
- HellaSwag
- MWoZ
- GeoQuery
- Xsum
- SST-2
- BoolQ
- IMDb
- BoolQ Contrast Set

