Automatic, pseudocode-based evaluation and a 100-protocol BIOPROT dataset to test LLM planning for lab protocols

October 16, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper provides a usable dataset, clear metrics, and a lab validation; however models are closed-source and retrieval and function-reuse remain weak, so expect human oversight and engineering work to adopt in production.

Citations6

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 60%

Authors

Odhran O'Donoghue, Aleksandar Shtedritski, John Ginger, Ralph Abboud, Ali Essa Ghareeb, Justin Booth, Samuel G Rodriques

Links

Abstract / PDF / Code / Data

Why It Matters For Business

BIOPROT and the pseudocode evaluation let teams measure and improve LLM planning for lab protocols quickly, reducing expert labeling and enabling reproducible protocol generation for automation workflows.

Who Should Care

Summary TLDR

The authors build BIOPROT: 100 biology lab protocols translated into model-readable pseudocode. They propose evaluating LLMs by giving admissible pseudofunctions (a closed action set) and scoring generated pseudocode for function choice, ordering, and argument accuracy. GPT-4 outperforms GPT-3.5 on ordering and step prediction; many GPT-generated pseudocodes needed no manual edits. They demonstrate real-world utility by generating two new protocols and successfully running one in a lab. The dataset, code, and prompts are public.

Problem Statement

Natural language is brittle for evaluating multi-step lab protocols: small differences matter, descriptions vary in detail, and manual expert review is slow. The paper proposes turning protocols into pseudocode and a closed action set so models can be evaluated automatically and robustly.

Main Contribution

A protocol-evaluation method that converts natural-language protocols into protocol-specific pseudofunctions and pseudocode.

BIOPROT: a manually reviewed dataset of 100 biology protocols with pseudocode and machine-generated summaries.

Key Findings

BIOPROT contains 100 biology protocols translated into pseudocode.

Numbers100 protocols; avg steps 12.5; avg pseudofunctions per protocol 10.3

Practical UseYou can use this ready dataset to benchmark LLM planning on multi-step lab tasks without building a corpus from scratch.

Evidence RefTable 1 and Table 2

GPT-4-generated pseudocode required no manual edits for the majority of protocols.

Numbers59% of generated protocols required no edits; edited files averaged 11.8 line edits

Practical UseA pipeline of LLM generation plus simple error-checking yields largely correct pseudocode, cutting expert labeling work by about half.

Evidence RefTable 3; Section 3.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy70.6% ± 0.4GPT-3.5 65.0% ± 1.3≈ +5.6 ppBIOPROT next-step taskTable 4 reports GPT-4 70.6% and GPT-3.5 65.0% on ordered functionsTable 4
Protocol generation: normalized Levenshtein distance (lower better)GPT-4 0.396 ± 0.046 (no shuffle, no feedback)GPT-3.5 0.498 ± 0.036≈ −0.102BIOPROT full protocol generationTable 5 shows GPT-4 0.396 vs GPT-3.5 0.498 for ordering accuracyTable 5

What To Try In 7 Days

Run the BIOPROT pipeline on your model to get function-level and ordering metrics.

Use the pseudofunction approach to convert 10 of your internal SOPs and spot-check LLM outputs.

Prototype a retrieval+assembly agent to auto-assemble protocol steps and have a scientist verify one simple experiment.

Agent Features

Memory
Retrieval memory (embedding index)
Planning
Planning with LLMsFunction-level planning (admissible action set)
Tool Use
Protocol search/retrieval toolEmbedding-based nearest-neighbour search
Frameworks
LangChainToolformer-like agent
Is Agentic

Yes

Architectures
LLM with tool accessToolformer-like chain-of-thought agent
Collaboration
Human-in-the-loop verification

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Relies on closed-source GPT APIs (authors spent ~ $1000 on calls).

Dataset focused on biology; other domains may need custom functions.

When Not To Use

For fully automated lab execution without human scientist review.

When open-source-only toolchains are required (paper uses GPT-4/3.5).

Failure Modes

LLM omits or mislabels units and parameters, causing unusable steps.

Function name mismatches (semantically same but syntactically different) penalize retrieval.

Core Entities

Models

GPT-3.5GPT-4Llama2-7B

Metrics

Accuracyfunctions precision/recallnormalized Levenshtein distanceSciBERTScoreBLEU

Datasets

BIOPROT

Benchmarks

next-step predictionprotocol generationfunction retrieval

Context Entities

Models

SciBERTtext-embedding-ada-002

Datasets

Protocols.io (source)