Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
4
Why It Matters For Business
You can cheaply build and improve multi-step question-answering agents without large human-labeled trajectory datasets, and then deploy much smaller, cheaper models that preserve most teacher performance on similar tasks.
Summary TLDR
The paper builds a ReAct-style search agent that reasons and calls a web-search tool, then applies a ReST-like iterative self-training loop using LLM-based ranking and AI feedback (no human labels) to grow and refine synthetic multi-step trajectories. After two iterations, small fine-tuned models (PaLM 2-XS/S) recover much of the teacher's performance on compositional QA benchmarks (Bamboogle/BamTwoogle). LLM auto-eval strongly matches human judgments (Pearson 0.98), letting the authors cheaply run many stochastic agent rollouts for evaluation and selection.
Problem Statement
Agent workflows that interleave reasoning and tool calls are hard to improve with standard end-to-end training because interactions with external tools are non-differentiable and human-labeled multi-step trajectories are expensive and scarce. The paper asks: can an agent bootstrap its own training data and improve via AI feedback alone?
Main Contribution
A ReAct-style search agent that formats prompts as code, preserves trajectory state, and uses self-checks (relevance and grounding).
An adaptation of Reinforced Self-Training (ReST) for agentic multi-step setups: generate trajectories, re-rank with an instruction-tuned LLM, fine-tune, and repeat.
Empirical evidence that two iterations of self-improvement plus distillation produce small models that approach large-model performance on compositional QA benchmarks.
Demonstration that LLM-based auto-eval aligns tightly with human judgments, enabling cheap, low-variance evaluation of stochastic agent rollouts.
Key Findings
Self-improvement raises small-model auto-eval accuracy substantially.
Distilled small models can approach large-model quality on these benchmarks.
LLM-based auto-eval matches human judgments closely.
Data quality matters more than raw size.
Self-critique provides a small positive boost.
Results
Bamboogle auto-eval (PaLM 2-L pre-trained)
Bamboogle auto-eval (PaLM 2-L, 2nd gen)
Bamboogle auto-eval (PaLM 2-XS, 2nd gen)
Accuracy
Auto-eval vs human alignment
Who Should Care
What To Try In 7 Days
Prompt a strong, prompted LLM to produce agent trajectories on a small set of hard questions.
Use an instruction-tuned LLM to re-rank sampled trajectory steps and filter the best traces.
Fine-tune a small model on the synthetic trajectories and compare with LLM auto-eval before human checks.
Agent Features
Memory
- short-term trajectory state stored in PAST ACTIONS
Planning
- decision loop for search vs answer
- multi-step thought-action-observation rounds
Tool Use
- web-search calls via Google Q&A API
- snippet summarization and link selection
Frameworks
- Reflexion
- ReST (adapted)
- RAFT-style ranking
Is Agentic
true
Architectures
- ReAct-style multi-step agent
- code-as-prompt formatting
Optimization Features
Token Efficiency
- LoRA
Model Optimization
- self-distillation into PaLM 2-XS/S
System Optimization
- auto-eval to reduce human evaluation effort
Training Optimization
- iterative self-training (grow/improve loop)
- LLM-based ranking instead of human reward model
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Small handcrafted evaluation sets (125 and 100 questions) limit generalization.
- Single search tool (internal Google Q&A API) used; other tools not tested.
- Manual few-shot prompts and code-as-prompt design increase engineering burden.
- Auto-eval depends on a strong reference LLM (PaLM 2-L), which is costly to run.
When Not To Use
- When you require formal correctness guarantees or legal-grade verification.
- If you lack access to a strong prompted teacher LLM for trajectory generation.
- For multi-tool agent setups without further validation; only single-tool tested here.
Failure Modes
- Hallucinated or poorly grounded summaries if search snippets are noisy.
- Auto-eval bias: risk of overfitting to the auto-evaluator's preferences.
- Propagation of low-quality actions across trajectories when re-ranking is imperfect.
Core Entities
Models
- PaLM 2-L
- PaLM 2-S
- PaLM 2-XS
Metrics
- Accuracy
- Pearson correlation (auto-eval vs human)
- Spearman correlation (auto-eval vs human)
Datasets
- Bamboogle
- BamTwoogle
- HotpotQA
- ELI5
- ELI5-askH
- ELI5-askS
Benchmarks
- Bamboogle
- BamTwoogle

