Aviary: train small open LLM agents to solve multi-step biology tasks and match frontier models at far lower inference cost

December 30, 20248 min

Overview

Decision SnapshotNeeds Validation

The framework and experiments demonstrate practical gains on measurable benchmarks and real tool chains; results are compelling but depend on environment fidelity, curated demos, and careful split design.

Citations7

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can train modest open LLMs to match or beat larger closed models and humans on multi-step scientific workflows while cutting inference cost by orders of magnitude, enabling cheaper high-throughput automation.

Who Should Care

Summary TLDR

The authors introduce Aviary, an open-source gym for training language agents on multi-step scientific tasks and formalize the task class as a language decision process (LDP). They ship five environments (GSM8K, HOTPOTQA, PaperQA/LitQA2, SeqQA/DNA cloning, protein stability) and show that a small open model (Llama-3.1-8B-Instruct), after behavior cloning + expert iteration and using inference-time majority voting, can match or exceed a frontier model (Claude 3.5 Sonnet) and human experts on SeqQA and LitQA2 while using up to ~100x less inference cost. Code and environments are released.

Problem Statement

Scientific tasks need repeated cycles of observation, tool calls, and reasoning. There is no easy, reusable framework to build, train, and evaluate language agents that use tools and multi-step plans. Can modest open LLMs be trained to reliably solve such scientific workflows at practical cost?

Main Contribution

Formalize language agent tasks as language decision processes (LDPs) and cast agents as stochastic computation graphs.

Release Aviary, a gym with five environments focused on multi-step and scientific tasks (including SeqQA, LitQA2, protein stability).

Key Findings

A trained Llama-3.1-8B-Instruct agent reached 0.89 test accuracy on SeqQA using large-sample majority voting.

Numbers0.89 accuracy (SeqQA, test; many-sample consensus)

Practical UseTrain an 8B open model with demonstrations and sample many rollouts to reach top SeqQA performance instead of relying on larger closed models.

Evidence RefFigure 6A and Sec.5.2

The same SeqQA accuracy can be reached at roughly 100x lower inference cost using the trained 8B agent vs Claude 3.5 Sonnet.

Numbers≈100x cheaper; example costs $0.021 vs ~$2.2 per task

Practical UseIf you need high-throughput automation, invest in training a cheaper open model and scale inference sampling to cut operating cost dramatically.

Evidence RefFigure 7 and Sec.5.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.89 (Llama-3.1-8B-Instruct, many-sample majority voting)previous best reported 0.87 (claude-3.5-sonnet report)+0.02SeqQA testFigure 6A, Sec.5.2Fig.6A
Accuracy≈0.88 (Llama-3.1-8B-Instruct with consensus sampling)claude-3-5-sonnet agent single-rollout ~0.89 (plateau reported)Llama matches/exceeds humans at modest samplingLitQA2 testFigure 6C and Sec.5.2Fig.6C

What To Try In 7 Days

Clone aviary and ldp repos and run a provided environment (SeqQA or PaperQA) locally.

Collect or generate a small set of high-quality trajectories from a stronger model and SFT the 8B model (behavior cloning).

Run one round of expert iteration to collect better rollouts and fine-tune further on the expanded buffer.

Agent Features

Memory
agent state includes conversation/memory (short-term)retrieval-augmented context (local search indexes)
Planning
internal text reasoning (ReAct-style)trajectory-level planning via expert iteration
Tool Use
API tool calls (calculator, search, Rosetta, sequence tools)retrieval + rerankingsimulation tools (assembly, PCR, Rosetta)
Frameworks
AviaryLDPpaper-qa
Is Agentic

Yes

Architectures
stochastic computation graphtool-calling agentrejection-sampling / ensemble sampling

Optimization Features

Token Efficiency
cheap-many-rollouts viable with small modeltrade compute for correctness via sampling
Infra Optimization
A100 GPUs for fine-tuninguse of low-cost inference endpoints for Llama-sized models
Model Optimization
SFT
System Optimization
batching agent states for efficient rolloutlocal search indexing (tantivy) for retrieval speed
Training Optimization
behavior cloning from high-quality trajectoriesexpert iteration with rejection sampling
Inference Optimization
majority voting / consensus@koracle pass@k for protein stability

Reproducibility

Risks & Boundaries

Limitations

Environment software is complex with many dependencies and version sensitivity.

Some dataset splits and indexes are not public text (DOIs only), limiting full data reproducibility.

When Not To Use

When you need provable correctness or full auditability for safety-critical lab work.

If you cannot produce good demonstrations for behavior cloning or lack compute to fine-tune.

Failure Modes

Hallucinated tool calls or incorrect tool usage leading to wrong answers.

Overfitting to demonstration trajectories and reduced trajectory diversity.

Core Entities

Models

Llama-3.1-8B-Instructclaude-3-5-sonnet-20241022gpt-4o

Metrics

AccuracyRosetta ddG (protein stability)

Datasets

GSM8KHOTPOTQALitQA2SeqQAMega-scale protein stability dataset

Benchmarks

SeqQALitQA2GSM8KHOTPOTQA

Context Entities

Models

ThermoMPNN (cited)OmegaFold (cited)