ISARA: iteratively self-align an LLM using retrieval-augmented in-context learning and <100 seed examples

January 6, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Hongyi Guo, Yuanshun Yao, Wei Shen, Jiaheng Wei, Xiaoying Zhang, Zhaoran Wang, Yang Liu

Links

Abstract / PDF

Why It Matters For Business

You can improve model safety and truthfulness in new domains with very small labeled seeds and no extra human rules or reward models, cutting annotation cost and speeding deployment.

Summary TLDR

The paper introduces ISARA, a practical recipe to align LLMs to new domains using only a small seed set (e.g., 64 samples). ISARA alternates retrieval-augmented in-context generation and supervised fine-tuning. It needs no hand-crafted instructions or external reward models. Across safety, truthfulness and instruction-following tests, ISARA expands the dataset 4–11×, improves harmlessness and truthfulness versus simple SFT and ICL baselines, and works on models down to ~350M parameters. Results rely on automatic classifiers and automatic evaluators.

Problem Statement

How can we align LLMs to a new target domain when only a handful of high-quality examples exist and we want to avoid hand-written instructions or building reward models?

Main Contribution

ISARA: an iterative pipeline that generates new labeled QA pairs via retrieval-augmented in-context learning (ICL) and then SFTs the model on those samples.

A human-instruction-free method: prompts use only example QA pairs, not handcrafted rules or principles.

Shows data-scaling and alignment gains with small seed sets (<100) on safety, truthfulness, and instruction-following benchmarks.

Demonstrates applicability to smaller models (down to OPT-350M) and reports iterative training outperforms a single large-generation round.

Key Findings

ISARA can sharply reduce harmful outputs on safety prompts.

NumbersLLaMA-7B harmful rate discrimination: 37.6% → 1.2% (pretrain → ISARA)

Iterative fine-tuning beats one-shot data generation when total new samples are equal.

NumbersLLaMA-7B harmful rate: 12.8% (one-shot 1024) → 5.6% (two iterations ×512)

ISARA expands small seed sets by several times via self-generation.

NumbersData-scaling: mean ×6.7 (LLaMA-7B) and ×6.3 (OPT-6.7B) on safety

ISARA improves truthfulness evaluation over SFT on TruthfulQA for LLaMA-7B.

NumbersROUGE-L diff: SFT -6.15 → ISARA +3.82 (LLaMA-7B)

Results

harmful rate (safety)

Value1.2% (LLaMA-7B, discrimination; ISARA)

Baseline37.6% (LLaMA-7B pretrained)

harmful rate (safety, averaged categories)

Value9.2% → 5.6% (LLaMA-7B: one iteration → two iterations)

Baseline12.8% (one-shot N=1024 variant)

truthfulness (ROUGE-L diff)

Value+3.82 (LLaMA-7B, ISARA)

Baseline-6.15 (LLaMA-7B, SFT)

data-scaling ratio

Value×6.7 (mean for LLaMA-7B on safety)

Baseline×1 (seed only)

instruction-following (winning rate)

ValueISARA beats SFT and ICL baselines (LLaMA-2-7B)

BaselineSFT and ICL methods

Who Should Care

What To Try In 7 Days

Pick a 50–100 example seed for a target domain (safety/truthfulness/helpfulness).

Implement retrieval-augmented ICL using kNN + sentence embeddings to produce new QA pairs.

Run 1–2 ISARA iterations (generate ~512 samples per iter), fine-tune, and compare harmful rate and utility with a classifier and reward model if available.

Agent Features

Tool Use

  • kNN retrieval
  • sentence embeddings
  • beam search decoding

Frameworks

  • ISARA (Iterative Self-Alignment with Retrieval-Augmented ICL)

Optimization Features

Infra Optimization

  • single A100-80G used for experiments

Training Optimization

  • iterative fine-tuning
  • mixing seed D0 with generated Dk using weight γ

Inference Optimization

  • retrieval-augmented in-context labeling (ICL-kNN)

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on a quality seed D0; poor seeds limit gains.
  • Evaluations use automatic classifiers and automatic judges, which can be biased.
  • Filtering heuristics (ROUGE-L threshold, length) may discard valid diversity or keep subtle low-quality samples.
  • Method depends on pretrained model capabilities; very tiny models show limited gains.

When Not To Use

  • When you require certified human-reviewed alignment for high-stakes applications.
  • If you have zero seed examples or no in-domain data to seed retrieval.
  • When you must avoid any automated data generation due to regulatory constraints.

Failure Modes

  • Model generates repeated or low-quality answers and amplifies bias present in seed data.
  • OOD retrieval returns irrelevant contexts, producing noisy annotations.
  • Iterative loop can overfit to artifacts of generated data and reduce generality.

Core Entities

Models

  • LLaMA-7B
  • OPT-6.7B
  • OPT-2.7B
  • OPT-1.3B
  • OPT-350M
  • LLaMA-2-7B

Metrics

  • harmful rate (classification)
  • ROUGE-L difference (truthfulness)
  • winning rate (AlpacaEval judge)
  • data-scaling ratio

Datasets

  • BeaverTails
  • TruthfulQA
  • Alpaca-Eval
  • Beaver-Dam-7B (classifier)
  • Beaver-7B-v1.0-Reward (reward model)

Benchmarks

  • BeaverTails (safety)
  • TruthfulQA (truthfulness)
  • AlpacaEval (instruction-following)