Train a small-model retriever to pick natural-language demonstrations that boost zero-shot LLMs across tasks and models

March 15, 20238 min

Overview

Decision SnapshotReady For Pilot

Training a retriever with a small frozen LLM is a practical, lower-cost way to improve zero-shot usage of many larger LLMs; validate per task because some tasks (coreference, certain commonsense) may worsen.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Denvy Deng, Qi Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Train a single lightweight retriever once (with a small model) to boost many larger LLMs at inference, cutting repeated fine-tuning costs and improving zero-shot accuracy on many NLU tasks.

Who Should Care

Summary TLDR

UPRISE trains a lightweight bi-encoder retriever (initialized from BERT-base) using a small frozen LLM (GPT‑Neo‑2.7B) as a labeler. The retriever picks natural-language demonstrations from a pool and prepends the top-K (K=3) to the test input. This improves zero-shot performance on many NLU clusters (notably Reading Comprehension and Paraphrase Detection), transfers from the small tuning model to much larger LLMs (BLOOM, OPT, GPT‑3 family), and reduces hallucination on fact-checking for ChatGPT in the experiments.

Problem Statement

Designing prompts or fine-tuning each new LLM is costly and brittle. Can we train a single, lightweight prompt retriever on diverse tasks (using a small frozen LLM to score prompts) that generalizes to unseen task types and to much larger LLMs at inference time?

Main Contribution

Introduce UPRISE: a prompt-retrieval pipeline that trains a bi-encoder retriever with a frozen LLM as a labeler.

Demonstrate cross-task and cross-model generalization: retriever tuned with GPT‑Neo‑2.7B transfers to BLOOM-7.1B, OPT-66B and GPT3 (Davinci) without extra tuning.

Key Findings

UPRISE raises zero-shot Reading Comprehension average on GPT‑Neo‑2.7B from 31.6 to 40.1 (absolute gain).

Numbers31.6 -> 40.1 (+8.5 pp)

Practical UseIf you have a small LLM or limited compute, training a retriever and prepending 3 retrieved demonstrations can meaningfully boost reading-comprehension style tasks in zero-shot.

Evidence RefTable 1 (Reading Comprehension averages)

Paraphrase detection sees large gains (example: MRPC accuracy 46.6 -> 67.9 with GPT‑Neo‑2.7B).

Numbers46.6 -> 67.9 (+21.3 pp)

Practical UseFor sentence-pair tasks, retrieving similar-format demonstrations is a cheap way to get large accuracy improvements without fine-tuning the large model.

Evidence RefTable 1 (MRPC row)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Reading Comprehension average (GPT-Neo-2.7B)40.131.6 (0-SHOT)+8.5 ppReading Comprehension cluster (avg over SQuADv1, BoolQ, MultiRC, OBQA)Table 1 shows 0-SHOT 31.6 -> UPRISE 40.1Table 1
Accuracy67.946.6 (0-SHOT)+21.3 ppMRPC (Paraphrase Detection)Table 1 MRPC rowTable 1

What To Try In 7 Days

Make a prompt pool from your labeled training examples and sample up to 10k per task to avoid dominance.

Score a subset (L=50) of prompts with a small frozen LLM to label positives/negatives and fine-tune a bi-encoder retriever (BERT-base init).

At inference, prepend top‑K=3 retrieved demonstrations to test inputs and compare against your 0‑shot baseline; watch for harm on coreference/commonsense.

Optimization Features

Infra Optimization
encode prompt pool once and use MIPS at inference
Training Optimization
score only random subset (L=50) to save computerepeat sampling up to 7 rounds to find positives
Inference Optimization
use K=3 prompts to balance cost and gain

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

public datasets listed in Appendix A (SQuADv1, BoolQ, MRPC, etc.)

Risks & Boundaries

Limitations

Performance drops on tasks formulated as pure language modeling (Commonsense Reasoning) and on Coreference Resolution.

Scoring prompts requires many model calls; authors reduce cost by sampling subsets but tuning still incurs nontrivial compute.

When Not To Use

For coreference or tasks that consistently underperform with demonstrations.

When you cannot assemble a diverse prompt pool of task demonstrations.

Failure Modes

Retrieved prompts with mismatched input-output formats can harm performance (e.g., Closed-book QA prompts retrieved for Commonsense tasks).

If no good positive prompts exist in the pool, retriever training may be noisy or filter out examples.

Core Entities

Models

GPT-Neo-2.7BBLOOM-7.1BOPT-66BGPT3-175B (Davinci / text-davinci-001)ChatGPT (gpt-3.5-turbo-0301)

Metrics

AccuracyF1Exact Match (EM)Per-token likelihood (for MC scoring)

Datasets

SQuADv1BoolQMultiRCOBQAARC-c/eNatural QuestionsMRPCQQPPAWSMNLIQNLISNLIRTESST-2YelpSentiment140TruthfulQAFEVER2.0Covid-19 fact-check subset

Benchmarks

Reading Comprehension clusterClosed-book QA clusterParaphrase Detection clusterNatural Language Inference clusterSentiment Analysis clusterCommonsense Reasoning clusterCoreference Resolution cluster