Train a search-based LLM agent to self-improve via iterative synthetic trajectories and distill it into much smaller models.

Overview

Decision SnapshotReady For Pilot

The paper shows consistent gains on small handcrafted benchmarks and strong auto-eval alignment, but experiments rely on a single search tool, a few model sizes, and small test sets, limiting generality.

Citations4

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, Manzil Zaheer, Felix Yu, Sanjiv Kumar

Links

Abstract / PDF

Why It Matters For Business

You can cheaply build and improve multi-step question-answering agents without large human-labeled trajectory datasets, and then deploy much smaller, cheaper models that preserve most teacher performance on similar tasks.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Founder

Summary TLDR

The paper builds a ReAct-style search agent that reasons and calls a web-search tool, then applies a ReST-like iterative self-training loop using LLM-based ranking and AI feedback (no human labels) to grow and refine synthetic multi-step trajectories. After two iterations, small fine-tuned models (PaLM 2-XS/S) recover much of the teacher's performance on compositional QA benchmarks (Bamboogle/BamTwoogle). LLM auto-eval strongly matches human judgments (Pearson 0.98), letting the authors cheaply run many stochastic agent rollouts for evaluation and selection.

Problem Statement

Agent workflows that interleave reasoning and tool calls are hard to improve with standard end-to-end training because interactions with external tools are non-differentiable and human-labeled multi-step trajectories are expensive and scarce. The paper asks: can an agent bootstrap its own training data and improve via AI feedback alone?

Main Contribution

A ReAct-style search agent that formats prompts as code, preserves trajectory state, and uses self-checks (relevance and grounding).

An adaptation of Reinforced Self-Training (ReST) for agentic multi-step setups: generate trajectories, re-rank with an instruction-tuned LLM, fine-tune, and repeat.

Key Findings

Self-improvement raises small-model auto-eval accuracy substantially.

NumbersPaLM 2-XS: 44.7±3.1% -> 65.9±2.6% (pilot to 2nd gen)

Practical UseFine-tuning small models on synthetic agent trajectories can boost accuracy by ~21 percentage points on evaluated compositional QA sets; use iterative self-training when human trajectory labels are unavailable.

Evidence RefTable 1; Table 3

Distilled small models can approach large-model quality on these benchmarks.

NumbersHuman eval: pre-trained PaLM 2-L 68.8% vs 2nd-gen XS 67.2% (Bamboogle)

Practical UseYou can distill a prompted large teacher into models one to two orders smaller and retain near-teacher performance on similar search-based QA tasks, lowering inference cost.

Evidence RefTable 2 (human eval)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Bamboogle auto-eval (PaLM 2-L pre-trained)	70.3 ± 3.5%	—	—	Bamboogle (auto-eval)	Table 1 pre-trained L	Table 1
Bamboogle auto-eval (PaLM 2-L, 2nd gen)	76.1 ± 1.3%	PaLM 2-L pre-trained 70.3 ± 3.5%	+5.8 pts	Bamboogle (auto-eval)	Table 1 2nd gen L	Table 1

What To Try In 7 Days

Prompt a strong, prompted LLM to produce agent trajectories on a small set of hard questions.

Use an instruction-tuned LLM to re-rank sampled trajectory steps and filter the best traces.

Fine-tune a small model on the synthetic trajectories and compare with LLM auto-eval before human checks.

Agent Features

Memory

short-term trajectory state stored in PAST ACTIONS

Planning

decision loop for search vs answermulti-step thought-action-observation rounds

Tool Use

web-search calls via Google Q&A APIsnippet summarization and link selection

Frameworks

ReflexionReST (adapted)RAFT-style ranking

Is Agentic

Yes

Architectures

ReAct-style multi-step agentcode-as-prompt formatting

Optimization Features

Token Efficiency

LoRA

Model Optimization

self-distillation into PaLM 2-XS/S

System Optimization

auto-eval to reduce human evaluation effort

Training Optimization

iterative self-training (grow/improve loop)LLM-based ranking instead of human reward model

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Small handcrafted evaluation sets (125 and 100 questions) limit generalization.

Single search tool (internal Google Q&A API) used; other tools not tested.

When Not To Use

When you require formal correctness guarantees or legal-grade verification.

If you lack access to a strong prompted teacher LLM for trajectory generation.

Failure Modes

Hallucinated or poorly grounded summaries if search snippets are noisy.

Auto-eval bias: risk of overfitting to the auto-evaluator's preferences.

Core Entities

Models

PaLM 2-LPaLM 2-SPaLM 2-XS

Metrics

AccuracyPearson correlation (auto-eval vs human)Spearman correlation (auto-eval vs human)

Datasets

BamboogleBamTwoogleHotpotQAELI5ELI5-askHELI5-askS

Benchmarks

BamboogleBamTwoogle

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Self-improvement raises small-model auto-eval accuracy substantially.

Distilled small models can approach large-model quality on these benchmarks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding