Train a small-model retriever to pick natural-language demonstrations that boost zero-shot LLMs across tasks and models

March 15, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

2

Authors

Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Denvy Deng, Qi Zhang

Links

Abstract / PDF

Why It Matters For Business

Train a single lightweight retriever once (with a small model) to boost many larger LLMs at inference, cutting repeated fine-tuning costs and improving zero-shot accuracy on many NLU tasks.

Summary TLDR

UPRISE trains a lightweight bi-encoder retriever (initialized from BERT-base) using a small frozen LLM (GPT‑Neo‑2.7B) as a labeler. The retriever picks natural-language demonstrations from a pool and prepends the top-K (K=3) to the test input. This improves zero-shot performance on many NLU clusters (notably Reading Comprehension and Paraphrase Detection), transfers from the small tuning model to much larger LLMs (BLOOM, OPT, GPT‑3 family), and reduces hallucination on fact-checking for ChatGPT in the experiments.

Problem Statement

Designing prompts or fine-tuning each new LLM is costly and brittle. Can we train a single, lightweight prompt retriever on diverse tasks (using a small frozen LLM to score prompts) that generalizes to unseen task types and to much larger LLMs at inference time?

Main Contribution

Introduce UPRISE: a prompt-retrieval pipeline that trains a bi-encoder retriever with a frozen LLM as a labeler.

Demonstrate cross-task and cross-model generalization: retriever tuned with GPT‑Neo‑2.7B transfers to BLOOM-7.1B, OPT-66B and GPT3 (Davinci) without extra tuning.

Show UPRISE can reduce hallucination on fact-checking for ChatGPT in human-evaluated samples.

Key Findings

UPRISE raises zero-shot Reading Comprehension average on GPT‑Neo‑2.7B from 31.6 to 40.1 (absolute gain).

Numbers31.6 -> 40.1 (+8.5 pp)

Paraphrase detection sees large gains (example: MRPC accuracy 46.6 -> 67.9 with GPT‑Neo‑2.7B).

Numbers46.6 -> 67.9 (+21.3 pp)

A retriever tuned with GPT‑Neo‑2.7B transfers to larger LLMs and shows consistent average gains (Davinci avg 45.9 -> 53.6).

Numbers45.9 -> 53.6 (+7.7 pp)

UPRISE reduces hallucination on fact-checking with ChatGPT: FEVER2.0 51 -> 56, Covid-19 47 -> 83 (human-evaluated samples).

NumbersFEVER2.0 51->56; Covid-19 47->83

UPRISE harms or does not help some tasks: Commonsense Reasoning average fell (63.9 -> 62.2) and Coreference Resolution declined (64.0 -> 62.1).

NumbersCommonsense 63.9->62.2; Coref 64.0->62.1

Results

Reading Comprehension average (GPT-Neo-2.7B)

Value40.1

Baseline31.6 (0-SHOT)

Accuracy

Value67.9

Baseline46.6 (0-SHOT)

Davinci average (selected clusters)

Value53.6

Baseline45.9 (0-SHOT)

FEVER2.0 (ChatGPT human-eval)

Value56

Baseline51 (0-SHOT)

Covid-19 fact-check (ChatGPT human-eval)

Value83

Baseline47 (0-SHOT)

Who Should Care

What To Try In 7 Days

Make a prompt pool from your labeled training examples and sample up to 10k per task to avoid dominance.

Score a subset (L=50) of prompts with a small frozen LLM to label positives/negatives and fine-tune a bi-encoder retriever (BERT-base init).

At inference, prepend top‑K=3 retrieved demonstrations to test inputs and compare against your 0‑shot baseline; watch for harm on coreference/commonsense.

Optimization Features

Infra Optimization

  • encode prompt pool once and use MIPS at inference

Training Optimization

  • score only random subset (L=50) to save compute
  • repeat sampling up to 7 rounds to find positives

Inference Optimization

  • use K=3 prompts to balance cost and gain

Reproducibility

Data Urls

  • public datasets listed in Appendix A (SQuADv1, BoolQ, MRPC, etc.)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance drops on tasks formulated as pure language modeling (Commonsense Reasoning) and on Coreference Resolution.
  • Scoring prompts requires many model calls; authors reduce cost by sampling subsets but tuning still incurs nontrivial compute.
  • Experiments are language-only; cross-modal or API/tool prompts untested.

When Not To Use

  • For coreference or tasks that consistently underperform with demonstrations.
  • When you cannot assemble a diverse prompt pool of task demonstrations.
  • If scoring many prompt+input pairs is prohibitively expensive for your budget.

Failure Modes

  • Retrieved prompts with mismatched input-output formats can harm performance (e.g., Closed-book QA prompts retrieved for Commonsense tasks).
  • If no good positive prompts exist in the pool, retriever training may be noisy or filter out examples.
  • Fine-tuned retriever may overfit to demonstration styles seen in training clusters and not match certain held-out formats.

Core Entities

Models

  • GPT-Neo-2.7B
  • BLOOM-7.1B
  • OPT-66B
  • GPT3-175B (Davinci / text-davinci-001)
  • ChatGPT (gpt-3.5-turbo-0301)

Metrics

  • Accuracy
  • F1
  • Exact Match (EM)
  • Per-token likelihood (for MC scoring)

Datasets

  • SQuADv1
  • BoolQ
  • MultiRC
  • OBQA
  • ARC-c/e
  • Natural Questions
  • MRPC
  • QQP
  • PAWS
  • MNLI
  • QNLI
  • SNLI
  • RTE
  • SST-2
  • Yelp
  • Sentiment140
  • TruthfulQA
  • FEVER2.0
  • Covid-19 fact-check subset

Benchmarks

  • Reading Comprehension cluster
  • Closed-book QA cluster
  • Paraphrase Detection cluster
  • Natural Language Inference cluster
  • Sentiment Analysis cluster
  • Commonsense Reasoning cluster
  • Coreference Resolution cluster