Overview
Training a retriever with a small frozen LLM is a practical, lower-cost way to improve zero-shot usage of many larger LLMs; validate per task because some tasks (coreference, certain commonsense) may worsen.
Citations2
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Train a single lightweight retriever once (with a small model) to boost many larger LLMs at inference, cutting repeated fine-tuning costs and improving zero-shot accuracy on many NLU tasks.
Who Should Care
Summary TLDR
UPRISE trains a lightweight bi-encoder retriever (initialized from BERT-base) using a small frozen LLM (GPT‑Neo‑2.7B) as a labeler. The retriever picks natural-language demonstrations from a pool and prepends the top-K (K=3) to the test input. This improves zero-shot performance on many NLU clusters (notably Reading Comprehension and Paraphrase Detection), transfers from the small tuning model to much larger LLMs (BLOOM, OPT, GPT‑3 family), and reduces hallucination on fact-checking for ChatGPT in the experiments.
Problem Statement
Designing prompts or fine-tuning each new LLM is costly and brittle. Can we train a single, lightweight prompt retriever on diverse tasks (using a small frozen LLM to score prompts) that generalizes to unseen task types and to much larger LLMs at inference time?
Main Contribution
Introduce UPRISE: a prompt-retrieval pipeline that trains a bi-encoder retriever with a frozen LLM as a labeler.
Demonstrate cross-task and cross-model generalization: retriever tuned with GPT‑Neo‑2.7B transfers to BLOOM-7.1B, OPT-66B and GPT3 (Davinci) without extra tuning.
Key Findings
UPRISE raises zero-shot Reading Comprehension average on GPT‑Neo‑2.7B from 31.6 to 40.1 (absolute gain).
Paraphrase detection sees large gains (example: MRPC accuracy 46.6 -> 67.9 with GPT‑Neo‑2.7B).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Reading Comprehension average (GPT-Neo-2.7B) | 40.1 | 31.6 (0-SHOT) | +8.5 pp | Reading Comprehension cluster (avg over SQuADv1, BoolQ, MultiRC, OBQA) | Table 1 shows 0-SHOT 31.6 -> UPRISE 40.1 | Table 1 |
| Accuracy | 67.9 | 46.6 (0-SHOT) | +21.3 pp | MRPC (Paraphrase Detection) | Table 1 MRPC row | Table 1 |
What To Try In 7 Days
Make a prompt pool from your labeled training examples and sample up to 10k per task to avoid dominance.
Score a subset (L=50) of prompts with a small frozen LLM to label positives/negatives and fine-tune a bi-encoder retriever (BERT-base init).
At inference, prepend top‑K=3 retrieved demonstrations to test inputs and compare against your 0‑shot baseline; watch for harm on coreference/commonsense.
Optimization Features
Infra Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Performance drops on tasks formulated as pure language modeling (Commonsense Reasoning) and on Coreference Resolution.
Scoring prompts requires many model calls; authors reduce cost by sampling subsets but tuning still incurs nontrivial compute.
When Not To Use
For coreference or tasks that consistently underperform with demonstrations.
When you cannot assemble a diverse prompt pool of task demonstrations.
Failure Modes
Retrieved prompts with mismatched input-output formats can harm performance (e.g., Closed-book QA prompts retrieved for Commonsense tasks).
If no good positive prompts exist in the pool, retriever training may be noisy or filter out examples.

