Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
6
Why It Matters For Business
BIOPROT and the pseudocode evaluation let teams measure and improve LLM planning for lab protocols quickly, reducing expert labeling and enabling reproducible protocol generation for automation workflows.
Summary TLDR
The authors build BIOPROT: 100 biology lab protocols translated into model-readable pseudocode. They propose evaluating LLMs by giving admissible pseudofunctions (a closed action set) and scoring generated pseudocode for function choice, ordering, and argument accuracy. GPT-4 outperforms GPT-3.5 on ordering and step prediction; many GPT-generated pseudocodes needed no manual edits. They demonstrate real-world utility by generating two new protocols and successfully running one in a lab. The dataset, code, and prompts are public.
Problem Statement
Natural language is brittle for evaluating multi-step lab protocols: small differences matter, descriptions vary in detail, and manual expert review is slow. The paper proposes turning protocols into pseudocode and a closed action set so models can be evaluated automatically and robustly.
Main Contribution
A protocol-evaluation method that converts natural-language protocols into protocol-specific pseudofunctions and pseudocode.
BIOPROT: a manually reviewed dataset of 100 biology protocols with pseudocode and machine-generated summaries.
A suite of tasks and metrics: next-step prediction, full protocol generation, function retrieval, and argument scoring.
Empirical evaluation of GPT-3.5, GPT-4, and Llama-2 on these tasks, including ablations (shuffled, feedback).
A proof-of-concept agent that assembles protocols from retrieved pseudofunctions and two generated protocols, one executed in a real lab.
Public release of dataset and code for reproducible benchmarking.
Key Findings
BIOPROT contains 100 biology protocols translated into pseudocode.
GPT-4-generated pseudocode required no manual edits for the majority of protocols.
GPT-4 is better at ordering steps than GPT-3.5 on full protocol generation.
Step-level function prediction is good but sensitive to input order.
Function retrieval from other protocols remains weak and ambiguous.
Real-world validation: a GPT-4–assembled protocol ran successfully in a lab.
Results
Accuracy
Protocol generation: normalized Levenshtein distance (lower better)
Function retrieval precision / recall (GPT-4, nearest neighbors)
Generated pseudocode quality (manual verification)
Human benchmark on function selection
Who Should Care
What To Try In 7 Days
Run the BIOPROT pipeline on your model to get function-level and ordering metrics.
Use the pseudofunction approach to convert 10 of your internal SOPs and spot-check LLM outputs.
Prototype a retrieval+assembly agent to auto-assemble protocol steps and have a scientist verify one simple experiment.
Agent Features
Memory
- Retrieval memory (embedding index)
Planning
- Planning with LLMs
- Function-level planning (admissible action set)
Tool Use
- Protocol search/retrieval tool
- Embedding-based nearest-neighbour search
Frameworks
- LangChain
- Toolformer-like agent
Is Agentic
true
Architectures
- LLM with tool access
- Toolformer-like chain-of-thought agent
Collaboration
- Human-in-the-loop verification
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on closed-source GPT APIs (authors spent ~ $1000 on calls).
- Dataset focused on biology; other domains may need custom functions.
- Function retrieval is ambiguous without canonical naming and hurts assembly.
- Automatic evaluation depends on the teacher model's generated pseudocode quality and manual corrections.
When Not To Use
- For fully automated lab execution without human scientist review.
- When open-source-only toolchains are required (paper uses GPT-4/3.5).
- For fields with protocols intentionally withheld for safety or legality.
Failure Modes
- LLM omits or mislabels units and parameters, causing unusable steps.
- Function name mismatches (semantically same but syntactically different) penalize retrieval.
- Evaluator bias when using an LLM (GPT-4) as judge; evaluator may prefer longer or more coherent outputs over correct ones.
- Shuffling admissible functions can drastically reduce model accuracy.
Core Entities
Models
- GPT-3.5
- GPT-4
- Llama2-7B
Metrics
- Accuracy
- functions precision/recall
- normalized Levenshtein distance
- SciBERTScore
- BLEU
Datasets
- BIOPROT
Benchmarks
- next-step prediction
- protocol generation
- function retrieval
Context Entities
Models
- SciBERT
- text-embedding-ada-002
Datasets
- Protocols.io (source)

