Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
Train a single lightweight retriever once (with a small model) to boost many larger LLMs at inference, cutting repeated fine-tuning costs and improving zero-shot accuracy on many NLU tasks.
Summary TLDR
UPRISE trains a lightweight bi-encoder retriever (initialized from BERT-base) using a small frozen LLM (GPT‑Neo‑2.7B) as a labeler. The retriever picks natural-language demonstrations from a pool and prepends the top-K (K=3) to the test input. This improves zero-shot performance on many NLU clusters (notably Reading Comprehension and Paraphrase Detection), transfers from the small tuning model to much larger LLMs (BLOOM, OPT, GPT‑3 family), and reduces hallucination on fact-checking for ChatGPT in the experiments.
Problem Statement
Designing prompts or fine-tuning each new LLM is costly and brittle. Can we train a single, lightweight prompt retriever on diverse tasks (using a small frozen LLM to score prompts) that generalizes to unseen task types and to much larger LLMs at inference time?
Main Contribution
Introduce UPRISE: a prompt-retrieval pipeline that trains a bi-encoder retriever with a frozen LLM as a labeler.
Demonstrate cross-task and cross-model generalization: retriever tuned with GPT‑Neo‑2.7B transfers to BLOOM-7.1B, OPT-66B and GPT3 (Davinci) without extra tuning.
Show UPRISE can reduce hallucination on fact-checking for ChatGPT in human-evaluated samples.
Key Findings
UPRISE raises zero-shot Reading Comprehension average on GPT‑Neo‑2.7B from 31.6 to 40.1 (absolute gain).
Paraphrase detection sees large gains (example: MRPC accuracy 46.6 -> 67.9 with GPT‑Neo‑2.7B).
A retriever tuned with GPT‑Neo‑2.7B transfers to larger LLMs and shows consistent average gains (Davinci avg 45.9 -> 53.6).
UPRISE reduces hallucination on fact-checking with ChatGPT: FEVER2.0 51 -> 56, Covid-19 47 -> 83 (human-evaluated samples).
UPRISE harms or does not help some tasks: Commonsense Reasoning average fell (63.9 -> 62.2) and Coreference Resolution declined (64.0 -> 62.1).
Results
Reading Comprehension average (GPT-Neo-2.7B)
Accuracy
Davinci average (selected clusters)
FEVER2.0 (ChatGPT human-eval)
Covid-19 fact-check (ChatGPT human-eval)
Who Should Care
What To Try In 7 Days
Make a prompt pool from your labeled training examples and sample up to 10k per task to avoid dominance.
Score a subset (L=50) of prompts with a small frozen LLM to label positives/negatives and fine-tune a bi-encoder retriever (BERT-base init).
At inference, prepend top‑K=3 retrieved demonstrations to test inputs and compare against your 0‑shot baseline; watch for harm on coreference/commonsense.
Optimization Features
Infra Optimization
- encode prompt pool once and use MIPS at inference
Training Optimization
- score only random subset (L=50) to save compute
- repeat sampling up to 7 rounds to find positives
Inference Optimization
- use K=3 prompts to balance cost and gain
Reproducibility
Code Urls
Data Urls
- public datasets listed in Appendix A (SQuADv1, BoolQ, MRPC, etc.)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Performance drops on tasks formulated as pure language modeling (Commonsense Reasoning) and on Coreference Resolution.
- Scoring prompts requires many model calls; authors reduce cost by sampling subsets but tuning still incurs nontrivial compute.
- Experiments are language-only; cross-modal or API/tool prompts untested.
When Not To Use
- For coreference or tasks that consistently underperform with demonstrations.
- When you cannot assemble a diverse prompt pool of task demonstrations.
- If scoring many prompt+input pairs is prohibitively expensive for your budget.
Failure Modes
- Retrieved prompts with mismatched input-output formats can harm performance (e.g., Closed-book QA prompts retrieved for Commonsense tasks).
- If no good positive prompts exist in the pool, retriever training may be noisy or filter out examples.
- Fine-tuned retriever may overfit to demonstration styles seen in training clusters and not match certain held-out formats.
Core Entities
Models
- GPT-Neo-2.7B
- BLOOM-7.1B
- OPT-66B
- GPT3-175B (Davinci / text-davinci-001)
- ChatGPT (gpt-3.5-turbo-0301)
Metrics
- Accuracy
- F1
- Exact Match (EM)
- Per-token likelihood (for MC scoring)
Datasets
- SQuADv1
- BoolQ
- MultiRC
- OBQA
- ARC-c/e
- Natural Questions
- MRPC
- QQP
- PAWS
- MNLI
- QNLI
- SNLI
- RTE
- SST-2
- Yelp
- Sentiment140
- TruthfulQA
- FEVER2.0
- Covid-19 fact-check subset
Benchmarks
- Reading Comprehension cluster
- Closed-book QA cluster
- Paraphrase Detection cluster
- Natural Language Inference cluster
- Sentiment Analysis cluster
- Commonsense Reasoning cluster
- Coreference Resolution cluster

