Aviary: train small open LLM agents to solve multi-step biology tasks and match frontier models at far lower inference cost

Overview

Decision SnapshotNeeds Validation

The framework and experiments demonstrate practical gains on measurable benchmarks and real tool chains; results are compelling but depend on environment fidelity, curated demos, and careful split design.

Citations7

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can train modest open LLMs to match or beat larger closed models and humans on multi-step scientific workflows while cutting inference cost by orders of magnitude, enabling cheaper high-throughput automation.

Who Should Care

CTO ML Engineer Product Manager Data Scientist

Summary TLDR

The authors introduce Aviary, an open-source gym for training language agents on multi-step scientific tasks and formalize the task class as a language decision process (LDP). They ship five environments (GSM8K, HOTPOTQA, PaperQA/LitQA2, SeqQA/DNA cloning, protein stability) and show that a small open model (Llama-3.1-8B-Instruct), after behavior cloning + expert iteration and using inference-time majority voting, can match or exceed a frontier model (Claude 3.5 Sonnet) and human experts on SeqQA and LitQA2 while using up to ~100x less inference cost. Code and environments are released.

Problem Statement

Scientific tasks need repeated cycles of observation, tool calls, and reasoning. There is no easy, reusable framework to build, train, and evaluate language agents that use tools and multi-step plans. Can modest open LLMs be trained to reliably solve such scientific workflows at practical cost?

Main Contribution

Formalize language agent tasks as language decision processes (LDPs) and cast agents as stochastic computation graphs.

Release Aviary, a gym with five environments focused on multi-step and scientific tasks (including SeqQA, LitQA2, protein stability).

Key Findings

A trained Llama-3.1-8B-Instruct agent reached 0.89 test accuracy on SeqQA using large-sample majority voting.

Numbers0.89 accuracy (SeqQA, test; many-sample consensus)

Practical UseTrain an 8B open model with demonstrations and sample many rollouts to reach top SeqQA performance instead of relying on larger closed models.

Evidence RefFigure 6A and Sec.5.2

The same SeqQA accuracy can be reached at roughly 100x lower inference cost using the trained 8B agent vs Claude 3.5 Sonnet.

Numbers≈100x cheaper; example costs $0.021 vs ~$2.2 per task

Practical UseIf you need high-throughput automation, invest in training a cheaper open model and scale inference sampling to cut operating cost dramatically.

Evidence RefFigure 7 and Sec.5.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.89 (Llama-3.1-8B-Instruct, many-sample majority voting)	previous best reported 0.87 (claude-3.5-sonnet report)	+0.02	SeqQA test	Figure 6A, Sec.5.2	Fig.6A
Accuracy	≈0.88 (Llama-3.1-8B-Instruct with consensus sampling)	claude-3-5-sonnet agent single-rollout ~0.89 (plateau reported)	Llama matches/exceeds humans at modest sampling	LitQA2 test	Figure 6C and Sec.5.2	Fig.6C

What To Try In 7 Days

Clone aviary and ldp repos and run a provided environment (SeqQA or PaperQA) locally.

Collect or generate a small set of high-quality trajectories from a stronger model and SFT the 8B model (behavior cloning).

Run one round of expert iteration to collect better rollouts and fine-tune further on the expanded buffer.

Agent Features

Memory

agent state includes conversation/memory (short-term)retrieval-augmented context (local search indexes)

Planning

internal text reasoning (ReAct-style)trajectory-level planning via expert iteration

Tool Use

API tool calls (calculator, search, Rosetta, sequence tools)retrieval + rerankingsimulation tools (assembly, PCR, Rosetta)

Frameworks

AviaryLDPpaper-qa

Is Agentic

Yes

Architectures

stochastic computation graphtool-calling agentrejection-sampling / ensemble sampling

Optimization Features

Token Efficiency

cheap-many-rollouts viable with small modeltrade compute for correctness via sampling

Infra Optimization

A100 GPUs for fine-tuninguse of low-cost inference endpoints for Llama-sized models

Model Optimization

SFT

System Optimization

batching agent states for efficient rolloutlocal search indexing (tantivy) for retrieval speed

Training Optimization

behavior cloning from high-quality trajectoriesexpert iteration with rejection sampling

Inference Optimization

majority voting / consensus@koracle pass@k for protein stability

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Future-House/aviary https://github.com/Future-House/ldp https://github.com/paper-qa/paper-qa

Data URLs

https://huggingface.co/datasets/aviary-paper-data (DOI lists and splits described)

Risks & Boundaries

Limitations

Environment software is complex with many dependencies and version sensitivity.

Some dataset splits and indexes are not public text (DOIs only), limiting full data reproducibility.

When Not To Use

When you need provable correctness or full auditability for safety-critical lab work.

If you cannot produce good demonstrations for behavior cloning or lack compute to fine-tune.

Failure Modes

Hallucinated tool calls or incorrect tool usage leading to wrong answers.

Overfitting to demonstration trajectories and reduced trajectory diversity.

Core Entities

Models

Llama-3.1-8B-Instructclaude-3-5-sonnet-20241022gpt-4o

Metrics

AccuracyRosetta ddG (protein stability)

Datasets

GSM8KHOTPOTQALitQA2SeqQAMega-scale protein stability dataset

Benchmarks

SeqQALitQA2GSM8KHOTPOTQA

Context Entities

Models

ThermoMPNN (cited)OmegaFold (cited)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A trained Llama-3.1-8B-Instruct agent reached 0.89 test accuracy on SeqQA using large-sample majority voting.

The same SeqQA accuracy can be reached at roughly 100x lower inference cost using the trained 8B agent vs Claude 3.5 Sonnet.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

Key finding

R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

Key finding

A single LLM can role-play homogeneous multi-agent workflows and cut inference cost via KV-cache reuse

Key finding

DeceptGuard: detect agent deception by reading CoT text and activation probes

Key finding