Train a search-based LLM agent to self-improve via iterative synthetic trajectories and distill it into much smaller models.

December 15, 20237 min

Overview

Decision SnapshotReady For Pilot

The paper shows consistent gains on small handcrafted benchmarks and strong auto-eval alignment, but experiments rely on a single search tool, a few model sizes, and small test sets, limiting generality.

Citations4

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, Manzil Zaheer, Felix Yu, Sanjiv Kumar

Links

Abstract / PDF

Why It Matters For Business

You can cheaply build and improve multi-step question-answering agents without large human-labeled trajectory datasets, and then deploy much smaller, cheaper models that preserve most teacher performance on similar tasks.

Who Should Care

Summary TLDR

The paper builds a ReAct-style search agent that reasons and calls a web-search tool, then applies a ReST-like iterative self-training loop using LLM-based ranking and AI feedback (no human labels) to grow and refine synthetic multi-step trajectories. After two iterations, small fine-tuned models (PaLM 2-XS/S) recover much of the teacher's performance on compositional QA benchmarks (Bamboogle/BamTwoogle). LLM auto-eval strongly matches human judgments (Pearson 0.98), letting the authors cheaply run many stochastic agent rollouts for evaluation and selection.

Problem Statement

Agent workflows that interleave reasoning and tool calls are hard to improve with standard end-to-end training because interactions with external tools are non-differentiable and human-labeled multi-step trajectories are expensive and scarce. The paper asks: can an agent bootstrap its own training data and improve via AI feedback alone?

Main Contribution

A ReAct-style search agent that formats prompts as code, preserves trajectory state, and uses self-checks (relevance and grounding).

An adaptation of Reinforced Self-Training (ReST) for agentic multi-step setups: generate trajectories, re-rank with an instruction-tuned LLM, fine-tune, and repeat.

Key Findings

Self-improvement raises small-model auto-eval accuracy substantially.

NumbersPaLM 2-XS: 44.7±3.1% -> 65.9±2.6% (pilot to 2nd gen)

Practical UseFine-tuning small models on synthetic agent trajectories can boost accuracy by ~21 percentage points on evaluated compositional QA sets; use iterative self-training when human trajectory labels are unavailable.

Evidence RefTable 1; Table 3

Distilled small models can approach large-model quality on these benchmarks.

NumbersHuman eval: pre-trained PaLM 2-L 68.8% vs 2nd-gen XS 67.2% (Bamboogle)

Practical UseYou can distill a prompted large teacher into models one to two orders smaller and retain near-teacher performance on similar search-based QA tasks, lowering inference cost.

Evidence RefTable 2 (human eval)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Bamboogle auto-eval (PaLM 2-L pre-trained)70.3 ± 3.5%Bamboogle (auto-eval)Table 1 pre-trained LTable 1
Bamboogle auto-eval (PaLM 2-L, 2nd gen)76.1 ± 1.3%PaLM 2-L pre-trained 70.3 ± 3.5%+5.8 ptsBamboogle (auto-eval)Table 1 2nd gen LTable 1

What To Try In 7 Days

Prompt a strong, prompted LLM to produce agent trajectories on a small set of hard questions.

Use an instruction-tuned LLM to re-rank sampled trajectory steps and filter the best traces.

Fine-tune a small model on the synthetic trajectories and compare with LLM auto-eval before human checks.

Agent Features

Memory
short-term trajectory state stored in PAST ACTIONS
Planning
decision loop for search vs answermulti-step thought-action-observation rounds
Tool Use
web-search calls via Google Q&A APIsnippet summarization and link selection
Frameworks
ReflexionReST (adapted)RAFT-style ranking
Is Agentic

Yes

Architectures
ReAct-style multi-step agentcode-as-prompt formatting

Optimization Features

Token Efficiency
LoRA
Model Optimization
self-distillation into PaLM 2-XS/S
System Optimization
auto-eval to reduce human evaluation effort
Training Optimization
iterative self-training (grow/improve loop)LLM-based ranking instead of human reward model

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Small handcrafted evaluation sets (125 and 100 questions) limit generalization.

Single search tool (internal Google Q&A API) used; other tools not tested.

When Not To Use

When you require formal correctness guarantees or legal-grade verification.

If you lack access to a strong prompted teacher LLM for trajectory generation.

Failure Modes

Hallucinated or poorly grounded summaries if search snippets are noisy.

Auto-eval bias: risk of overfitting to the auto-evaluator's preferences.

Core Entities

Models

PaLM 2-LPaLM 2-SPaLM 2-XS

Metrics

AccuracyPearson correlation (auto-eval vs human)Spearman correlation (auto-eval vs human)

Datasets

BamboogleBamTwoogleHotpotQAELI5ELI5-askHELI5-askS

Benchmarks

BamboogleBamTwoogle