Automatic, pseudocode-based evaluation and a 100-protocol BIOPROT dataset to test LLM planning for lab protocols

October 16, 20238 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

6

Authors

Odhran O'Donoghue, Aleksandar Shtedritski, John Ginger, Ralph Abboud, Ali Essa Ghareeb, Justin Booth, Samuel G Rodriques

Links

Abstract / PDF

Why It Matters For Business

BIOPROT and the pseudocode evaluation let teams measure and improve LLM planning for lab protocols quickly, reducing expert labeling and enabling reproducible protocol generation for automation workflows.

Summary TLDR

The authors build BIOPROT: 100 biology lab protocols translated into model-readable pseudocode. They propose evaluating LLMs by giving admissible pseudofunctions (a closed action set) and scoring generated pseudocode for function choice, ordering, and argument accuracy. GPT-4 outperforms GPT-3.5 on ordering and step prediction; many GPT-generated pseudocodes needed no manual edits. They demonstrate real-world utility by generating two new protocols and successfully running one in a lab. The dataset, code, and prompts are public.

Problem Statement

Natural language is brittle for evaluating multi-step lab protocols: small differences matter, descriptions vary in detail, and manual expert review is slow. The paper proposes turning protocols into pseudocode and a closed action set so models can be evaluated automatically and robustly.

Main Contribution

A protocol-evaluation method that converts natural-language protocols into protocol-specific pseudofunctions and pseudocode.

BIOPROT: a manually reviewed dataset of 100 biology protocols with pseudocode and machine-generated summaries.

A suite of tasks and metrics: next-step prediction, full protocol generation, function retrieval, and argument scoring.

Empirical evaluation of GPT-3.5, GPT-4, and Llama-2 on these tasks, including ablations (shuffled, feedback).

A proof-of-concept agent that assembles protocols from retrieved pseudofunctions and two generated protocols, one executed in a real lab.

Public release of dataset and code for reproducible benchmarking.

Key Findings

BIOPROT contains 100 biology protocols translated into pseudocode.

Numbers100 protocols; avg steps 12.5; avg pseudofunctions per protocol 10.3

GPT-4-generated pseudocode required no manual edits for the majority of protocols.

Numbers59% of generated protocols required no edits; edited files averaged 11.8 line edits

GPT-4 is better at ordering steps than GPT-3.5 on full protocol generation.

NumbersNormalized Levenshtein distance (lower is better): GPT-4 0.396 vs GPT-3.5 0.498 (no shuffle, no feedback)

Step-level function prediction is good but sensitive to input order.

NumbersNext-step function accuracy: GPT-4 70.6% (ordered) vs 57.0% (shuffled)

Function retrieval from other protocols remains weak and ambiguous.

NumbersGPT-4 retrieval precision/recall: nearest 32.5% / 39.2%; random 48.8% / 49.4%

Real-world validation: a GPT-4–assembled protocol ran successfully in a lab.

NumbersOne generated E.coli glycerol-stock protocol executed and produced viable cells after -80°C storage (Figure 3; Section 5

Results

Accuracy

Value70.6% ± 0.4

BaselineGPT-3.5 65.0% ± 1.3

Protocol generation: normalized Levenshtein distance (lower better)

ValueGPT-4 0.396 ± 0.046 (no shuffle, no feedback)

BaselineGPT-3.5 0.498 ± 0.036

Function retrieval precision / recall (GPT-4, nearest neighbors)

ValuePrecision 32.5%; Recall 39.2%

BaselineGPT-3.5 Precision 24.2%; Recall 35.7%

Generated pseudocode quality (manual verification)

Value59% required no human edits

Baseline41% required ≥1 edit

Human benchmark on function selection

ValuePrecision 87.5%; Recall 84% (n=20)

BaselineGPT-4 lower on same task

Who Should Care

What To Try In 7 Days

Run the BIOPROT pipeline on your model to get function-level and ordering metrics.

Use the pseudofunction approach to convert 10 of your internal SOPs and spot-check LLM outputs.

Prototype a retrieval+assembly agent to auto-assemble protocol steps and have a scientist verify one simple experiment.

Agent Features

Memory

  • Retrieval memory (embedding index)

Planning

  • Planning with LLMs
  • Function-level planning (admissible action set)

Tool Use

  • Protocol search/retrieval tool
  • Embedding-based nearest-neighbour search

Frameworks

  • LangChain
  • Toolformer-like agent

Is Agentic

true

Architectures

  • LLM with tool access
  • Toolformer-like chain-of-thought agent

Collaboration

  • Human-in-the-loop verification

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on closed-source GPT APIs (authors spent ~ $1000 on calls).
  • Dataset focused on biology; other domains may need custom functions.
  • Function retrieval is ambiguous without canonical naming and hurts assembly.
  • Automatic evaluation depends on the teacher model's generated pseudocode quality and manual corrections.

When Not To Use

  • For fully automated lab execution without human scientist review.
  • When open-source-only toolchains are required (paper uses GPT-4/3.5).
  • For fields with protocols intentionally withheld for safety or legality.

Failure Modes

  • LLM omits or mislabels units and parameters, causing unusable steps.
  • Function name mismatches (semantically same but syntactically different) penalize retrieval.
  • Evaluator bias when using an LLM (GPT-4) as judge; evaluator may prefer longer or more coherent outputs over correct ones.
  • Shuffling admissible functions can drastically reduce model accuracy.

Core Entities

Models

  • GPT-3.5
  • GPT-4
  • Llama2-7B

Metrics

  • Accuracy
  • functions precision/recall
  • normalized Levenshtein distance
  • SciBERTScore
  • BLEU

Datasets

  • BIOPROT

Benchmarks

  • next-step prediction
  • protocol generation
  • function retrieval

Context Entities

Models

  • SciBERT
  • text-embedding-ada-002

Datasets

  • Protocols.io (source)