Overview
The paper shows consistent gains on small handcrafted benchmarks and strong auto-eval alignment, but experiments rely on a single search tool, a few model sizes, and small test sets, limiting generality.
Citations4
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/5
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
You can cheaply build and improve multi-step question-answering agents without large human-labeled trajectory datasets, and then deploy much smaller, cheaper models that preserve most teacher performance on similar tasks.
Who Should Care
Summary TLDR
The paper builds a ReAct-style search agent that reasons and calls a web-search tool, then applies a ReST-like iterative self-training loop using LLM-based ranking and AI feedback (no human labels) to grow and refine synthetic multi-step trajectories. After two iterations, small fine-tuned models (PaLM 2-XS/S) recover much of the teacher's performance on compositional QA benchmarks (Bamboogle/BamTwoogle). LLM auto-eval strongly matches human judgments (Pearson 0.98), letting the authors cheaply run many stochastic agent rollouts for evaluation and selection.
Problem Statement
Agent workflows that interleave reasoning and tool calls are hard to improve with standard end-to-end training because interactions with external tools are non-differentiable and human-labeled multi-step trajectories are expensive and scarce. The paper asks: can an agent bootstrap its own training data and improve via AI feedback alone?
Main Contribution
A ReAct-style search agent that formats prompts as code, preserves trajectory state, and uses self-checks (relevance and grounding).
An adaptation of Reinforced Self-Training (ReST) for agentic multi-step setups: generate trajectories, re-rank with an instruction-tuned LLM, fine-tune, and repeat.
Key Findings
Self-improvement raises small-model auto-eval accuracy substantially.
Distilled small models can approach large-model quality on these benchmarks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Bamboogle auto-eval (PaLM 2-L pre-trained) | 70.3 ± 3.5% | — | — | Bamboogle (auto-eval) | Table 1 pre-trained L | Table 1 |
| Bamboogle auto-eval (PaLM 2-L, 2nd gen) | 76.1 ± 1.3% | PaLM 2-L pre-trained 70.3 ± 3.5% | +5.8 pts | Bamboogle (auto-eval) | Table 1 2nd gen L | Table 1 |
What To Try In 7 Days
Prompt a strong, prompted LLM to produce agent trajectories on a small set of hard questions.
Use an instruction-tuned LLM to re-rank sampled trajectory steps and filter the best traces.
Fine-tune a small model on the synthetic trajectories and compare with LLM auto-eval before human checks.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Small handcrafted evaluation sets (125 and 100 questions) limit generalization.
Single search tool (internal Google Q&A API) used; other tools not tested.
When Not To Use
When you require formal correctness guarantees or legal-grade verification.
If you lack access to a strong prompted teacher LLM for trajectory generation.
Failure Modes
Hallucinated or poorly grounded summaries if search snippets are noisy.
Auto-eval bias: risk of overfitting to the auto-evaluator's preferences.

