TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Overview

Decision SnapshotNeeds Validation

The system shows strong performance on a realistic curated benchmark (12/15 tasks) and provides reproducible environment snapshots, but assumes reasonably well-structured repos and needs human oversight for safety-critical or fragile repos.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automates converting published code into reusable tools, cutting expert setup time and enabling rapid integration of domain-specific methods into production agents at a low per-tool cost.

Who Should Care

ML Engineer Product Manager Engineering Lead Founder Data Scientist

Summary TLDR

TOOLMAKER is an agent framework that automatically converts scientific code repositories into Dockerized, callable Python tools for LLM agents. It (1) installs and snapshots environments, (2) implements a Python wrapper function, and (3) uses a closed-loop self-correction loop to debug. The authors release TM-BENCH (15 real scientific tasks, 42 held-out invocations, 124 unit tests). On TM-BENCH TOOLMAKER implemented 12/15 tools (80%); a strong baseline (OpenHands) implemented 3/15 (20%). TOOLMAKER is practical for automating repository deployment but assumes reasonably well-structured repos and still needs human oversight for safety-critical domains.

Problem Statement

LLM agents need external tools to solve complex, multi-step scientific tasks, but humans must still install, configure, and adapt those tools. This manual setup blocks broad automation in domains with many specialized tools (e.g., life sciences). The paper asks: can an LLM agent autonomously turn a published code repo into a reusable, tested tool?

Main Contribution

TOOLMAKER: an agentic two-stage workflow that (a) auto-installs and snapshots environments (Docker) and (b) generates a Python tool function with closed-loop self-correction for debugging.

TM-BENCH: a realistic benchmark of 15 repository-based scientific tasks with 42 held-out invocations and 124 unit tests to measure correctness and robustness.

Key Findings

TOOLMAKER implemented 80% of benchmark tools (12 of 15)

Numbers12/15 tools correct

Practical UseYou can auto-deploy many complex research tools without manual install for most curated repos.

Evidence RefTable 2, Sec. 5 Results

OpenHands baseline implemented 20% (3 of 15)

Numbers3/15 tools correct

Practical UseExisting software-engineering agents fail more often at environment setup; invest in environment capture and testing.

Evidence RefTable 2, Sec. 5 Results

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	80% (12/15 tools passed all tests)	20% (3/15 by OpenHands)	+60 pp	TM-BENCH (15 tasks)	Table 2, Sec. 5 Results	Table 2
Average per-tool creation cost	$0.94	$0.15 (OpenHands)	+$0.79	Per-tool average reported across TM-BENCH	Table 3, Sec. 5 Results	Table 3

What To Try In 7 Days

Run TOOLMAKER on 3 of your org's public GitHub repos to evaluate auto-deployability and measure saved engineering time.

Integrate the generated Dockerized tools into one agent pipeline and run a smoke test using held-out inputs.

Use TM-BENCH or a tiny subset to validate tool correctness before production use.

Agent Features

Memory

Conversation history snapshotsEnvironment snapshots via Docker checkpoint

Planning

LLM planning for implementation stepsPlan + implement + diagnose loop

Tool Use

OS interactions (bash commands, installs)Function-calling API to control agents

Frameworks

OpenAI function-calling APIsDocker checkpointing

Is Agentic

Yes

Architectures

LLM-driven agent (planning + function calling)Dockerized execution sandbox

Collaboration

Structured tool-augmented LLM calls (agents chain LLM and environment interactions)

Optimization Features

Token Efficiency

Use of paper summaries to reduce token use and iterations

System Optimization

Environment snapshotting to reset state between iterations

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/KatherLab/ToolMaker

Data URLs

TM-BENCH (linked from paper / GitHub repository)

Risks & Boundaries

Limitations

Assumes referenced repositories are reasonably well-structured, documented, and installable; not guaranteed for arbitrary repos.

TM-BENCH is curated; success may drop on unvetted or broken repositories.

When Not To Use

For non-public or poorly documented repositories that cannot be installed automatically.

Directly in safety-critical clinical or wet-lab decision making without domain expert review.

Failure Modes

Environment installation fails due to missing or fragile external dependencies.

Generated tool hard-codes example invocation and fails to generalize to held-out inputs.

Core Entities

Models

gpt-4o-2024-08-06o3-minio1-mini-2024-09-12Claude 3.5 SonnetOpenHands (baseline)

Metrics

tool correctness (all unit tests pass)per-tool cost (USD)number of agent actionsnumber of self-correcting iterations

Datasets

TM-BENCH (15 tasks, 42 invocations, 124 unit tests)

Benchmarks

TM-BENCHSWE-bench (related baseline benchmark)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

TOOLMAKER implemented 80% of benchmark tools (12 of 15)

OpenHands baseline implemented 20% (3 of 15)

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding

Use formal EDA feedback inside a multi-agent controller to improve Verilog generation without expensive fine-tuning.

Key finding