Automatic, pseudocode-based evaluation and a 100-protocol BIOPROT dataset to test LLM planning for lab protocols

Overview

Decision SnapshotNeeds Validation

The paper provides a usable dataset, clear metrics, and a lab validation; however models are closed-source and retrieval and function-reuse remain weak, so expect human oversight and engineering work to adopt in production.

Citations6

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 60%

Authors

Odhran O'Donoghue, Aleksandar Shtedritski, John Ginger, Ralph Abboud, Ali Essa Ghareeb, Justin Booth, Samuel G Rodriques

Links

Abstract / PDF / Code / Data

Why It Matters For Business

BIOPROT and the pseudocode evaluation let teams measure and improve LLM planning for lab protocols quickly, reducing expert labeling and enabling reproducible protocol generation for automation workflows.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Founder

Summary TLDR

The authors build BIOPROT: 100 biology lab protocols translated into model-readable pseudocode. They propose evaluating LLMs by giving admissible pseudofunctions (a closed action set) and scoring generated pseudocode for function choice, ordering, and argument accuracy. GPT-4 outperforms GPT-3.5 on ordering and step prediction; many GPT-generated pseudocodes needed no manual edits. They demonstrate real-world utility by generating two new protocols and successfully running one in a lab. The dataset, code, and prompts are public.

Problem Statement

Natural language is brittle for evaluating multi-step lab protocols: small differences matter, descriptions vary in detail, and manual expert review is slow. The paper proposes turning protocols into pseudocode and a closed action set so models can be evaluated automatically and robustly.

Main Contribution

A protocol-evaluation method that converts natural-language protocols into protocol-specific pseudofunctions and pseudocode.

BIOPROT: a manually reviewed dataset of 100 biology protocols with pseudocode and machine-generated summaries.

Key Findings

BIOPROT contains 100 biology protocols translated into pseudocode.

Numbers100 protocols; avg steps 12.5; avg pseudofunctions per protocol 10.3

Practical UseYou can use this ready dataset to benchmark LLM planning on multi-step lab tasks without building a corpus from scratch.

Evidence RefTable 1 and Table 2

GPT-4-generated pseudocode required no manual edits for the majority of protocols.

Numbers59% of generated protocols required no edits; edited files averaged 11.8 line edits

Practical UseA pipeline of LLM generation plus simple error-checking yields largely correct pseudocode, cutting expert labeling work by about half.

Evidence RefTable 3; Section 3.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	70.6% ± 0.4	GPT-3.5 65.0% ± 1.3	≈ +5.6 pp	BIOPROT next-step task	Table 4 reports GPT-4 70.6% and GPT-3.5 65.0% on ordered functions	Table 4
Protocol generation: normalized Levenshtein distance (lower better)	GPT-4 0.396 ± 0.046 (no shuffle, no feedback)	GPT-3.5 0.498 ± 0.036	≈ −0.102	BIOPROT full protocol generation	Table 5 shows GPT-4 0.396 vs GPT-3.5 0.498 for ordering accuracy	Table 5

What To Try In 7 Days

Run the BIOPROT pipeline on your model to get function-level and ordering metrics.

Use the pseudofunction approach to convert 10 of your internal SOPs and spot-check LLM outputs.

Prototype a retrieval+assembly agent to auto-assemble protocol steps and have a scientist verify one simple experiment.

Agent Features

Memory

Retrieval memory (embedding index)

Planning

Planning with LLMsFunction-level planning (admissible action set)

Tool Use

Protocol search/retrieval toolEmbedding-based nearest-neighbour search

Frameworks

LangChainToolformer-like agent

Is Agentic

Yes

Architectures

LLM with tool accessToolformer-like chain-of-thought agent

Collaboration

Human-in-the-loop verification

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/bioplanner/bioplanner

Data URLs

https://github.com/bioplanner/bioplanner

Risks & Boundaries

Limitations

Relies on closed-source GPT APIs (authors spent ~ $1000 on calls).

Dataset focused on biology; other domains may need custom functions.

When Not To Use

For fully automated lab execution without human scientist review.

When open-source-only toolchains are required (paper uses GPT-4/3.5).

Failure Modes

LLM omits or mislabels units and parameters, causing unusable steps.

Function name mismatches (semantically same but syntactically different) penalize retrieval.

Core Entities

Models

GPT-3.5GPT-4Llama2-7B

Metrics

Accuracyfunctions precision/recallnormalized Levenshtein distanceSciBERTScoreBLEU

Datasets

BIOPROT

Benchmarks

next-step predictionprotocol generationfunction retrieval

Context Entities

Models

SciBERTtext-embedding-ada-002

Datasets

Protocols.io (source)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

BIOPROT contains 100 biology protocols translated into pseudocode.

GPT-4-generated pseudocode required no manual edits for the majority of protocols.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding