Train a search-based LLM agent to self-improve via iterative synthetic trajectories and distill it into much smaller models.

December 15, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

4

Authors

Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, Manzil Zaheer, Felix Yu, Sanjiv Kumar

Links

Abstract / PDF

Why It Matters For Business

You can cheaply build and improve multi-step question-answering agents without large human-labeled trajectory datasets, and then deploy much smaller, cheaper models that preserve most teacher performance on similar tasks.

Summary TLDR

The paper builds a ReAct-style search agent that reasons and calls a web-search tool, then applies a ReST-like iterative self-training loop using LLM-based ranking and AI feedback (no human labels) to grow and refine synthetic multi-step trajectories. After two iterations, small fine-tuned models (PaLM 2-XS/S) recover much of the teacher's performance on compositional QA benchmarks (Bamboogle/BamTwoogle). LLM auto-eval strongly matches human judgments (Pearson 0.98), letting the authors cheaply run many stochastic agent rollouts for evaluation and selection.

Problem Statement

Agent workflows that interleave reasoning and tool calls are hard to improve with standard end-to-end training because interactions with external tools are non-differentiable and human-labeled multi-step trajectories are expensive and scarce. The paper asks: can an agent bootstrap its own training data and improve via AI feedback alone?

Main Contribution

A ReAct-style search agent that formats prompts as code, preserves trajectory state, and uses self-checks (relevance and grounding).

An adaptation of Reinforced Self-Training (ReST) for agentic multi-step setups: generate trajectories, re-rank with an instruction-tuned LLM, fine-tune, and repeat.

Empirical evidence that two iterations of self-improvement plus distillation produce small models that approach large-model performance on compositional QA benchmarks.

Demonstration that LLM-based auto-eval aligns tightly with human judgments, enabling cheap, low-variance evaluation of stochastic agent rollouts.

Key Findings

Self-improvement raises small-model auto-eval accuracy substantially.

NumbersPaLM 2-XS: 44.7±3.1% -> 65.9±2.6% (pilot to 2nd gen)

Distilled small models can approach large-model quality on these benchmarks.

NumbersHuman eval: pre-trained PaLM 2-L 68.8% vs 2nd-gen XS 67.2% (Bamboogle)

LLM-based auto-eval matches human judgments closely.

NumbersPearson r=0.98 (p=6.6e-8), Spearman r=0.83 (p=0.0015)

Data quality matters more than raw size.

Numbers1st gen 54.4% -> 2nd gen (1x) 63.4% (≈+9%) at similar training size

Self-critique provides a small positive boost.

NumbersTypical gain ≈0.5–1.0% on evaluated models

Results

Bamboogle auto-eval (PaLM 2-L pre-trained)

Value70.3 ± 3.5%

Bamboogle auto-eval (PaLM 2-L, 2nd gen)

Value76.1 ± 1.3%

BaselinePaLM 2-L pre-trained 70.3 ± 3.5%

Bamboogle auto-eval (PaLM 2-XS, 2nd gen)

Value65.9 ± 2.6%

BaselinePilot, human filtered 44.7 ± 3.1%

Accuracy

ValuePre-trained L: 68.8%; 2nd-gen XS: 67.2%; 2nd-gen S: 68%; 2nd-gen L: 74.4%

Auto-eval vs human alignment

ValuePearson r=0.98; Spearman r=0.83

Who Should Care

What To Try In 7 Days

Prompt a strong, prompted LLM to produce agent trajectories on a small set of hard questions.

Use an instruction-tuned LLM to re-rank sampled trajectory steps and filter the best traces.

Fine-tune a small model on the synthetic trajectories and compare with LLM auto-eval before human checks.

Agent Features

Memory

  • short-term trajectory state stored in PAST ACTIONS

Planning

  • decision loop for search vs answer
  • multi-step thought-action-observation rounds

Tool Use

  • web-search calls via Google Q&A API
  • snippet summarization and link selection

Frameworks

  • Reflexion
  • ReST (adapted)
  • RAFT-style ranking

Is Agentic

true

Architectures

  • ReAct-style multi-step agent
  • code-as-prompt formatting

Optimization Features

Token Efficiency

  • LoRA

Model Optimization

  • self-distillation into PaLM 2-XS/S

System Optimization

  • auto-eval to reduce human evaluation effort

Training Optimization

  • iterative self-training (grow/improve loop)
  • LLM-based ranking instead of human reward model

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small handcrafted evaluation sets (125 and 100 questions) limit generalization.
  • Single search tool (internal Google Q&A API) used; other tools not tested.
  • Manual few-shot prompts and code-as-prompt design increase engineering burden.
  • Auto-eval depends on a strong reference LLM (PaLM 2-L), which is costly to run.

When Not To Use

  • When you require formal correctness guarantees or legal-grade verification.
  • If you lack access to a strong prompted teacher LLM for trajectory generation.
  • For multi-tool agent setups without further validation; only single-tool tested here.

Failure Modes

  • Hallucinated or poorly grounded summaries if search snippets are noisy.
  • Auto-eval bias: risk of overfitting to the auto-evaluator's preferences.
  • Propagation of low-quality actions across trajectories when re-ranking is imperfect.

Core Entities

Models

  • PaLM 2-L
  • PaLM 2-S
  • PaLM 2-XS

Metrics

  • Accuracy
  • Pearson correlation (auto-eval vs human)
  • Spearman correlation (auto-eval vs human)

Datasets

  • Bamboogle
  • BamTwoogle
  • HotpotQA
  • ELI5
  • ELI5-askH
  • ELI5-askS

Benchmarks

  • Bamboogle
  • BamTwoogle