TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

February 17, 20257 min

Overview

Decision SnapshotNeeds Validation

The system shows strong performance on a realistic curated benchmark (12/15 tasks) and provides reproducible environment snapshots, but assumes reasonably well-structured repos and needs human oversight for safety-critical or fragile repos.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automates converting published code into reusable tools, cutting expert setup time and enabling rapid integration of domain-specific methods into production agents at a low per-tool cost.

Who Should Care

Summary TLDR

TOOLMAKER is an agent framework that automatically converts scientific code repositories into Dockerized, callable Python tools for LLM agents. It (1) installs and snapshots environments, (2) implements a Python wrapper function, and (3) uses a closed-loop self-correction loop to debug. The authors release TM-BENCH (15 real scientific tasks, 42 held-out invocations, 124 unit tests). On TM-BENCH TOOLMAKER implemented 12/15 tools (80%); a strong baseline (OpenHands) implemented 3/15 (20%). TOOLMAKER is practical for automating repository deployment but assumes reasonably well-structured repos and still needs human oversight for safety-critical domains.

Problem Statement

LLM agents need external tools to solve complex, multi-step scientific tasks, but humans must still install, configure, and adapt those tools. This manual setup blocks broad automation in domains with many specialized tools (e.g., life sciences). The paper asks: can an LLM agent autonomously turn a published code repo into a reusable, tested tool?

Main Contribution

TOOLMAKER: an agentic two-stage workflow that (a) auto-installs and snapshots environments (Docker) and (b) generates a Python tool function with closed-loop self-correction for debugging.

TM-BENCH: a realistic benchmark of 15 repository-based scientific tasks with 42 held-out invocations and 124 unit tests to measure correctness and robustness.

Key Findings

TOOLMAKER implemented 80% of benchmark tools (12 of 15)

Numbers12/15 tools correct

Practical UseYou can auto-deploy many complex research tools without manual install for most curated repos.

Evidence RefTable 2, Sec. 5 Results

OpenHands baseline implemented 20% (3 of 15)

Numbers3/15 tools correct

Practical UseExisting software-engineering agents fail more often at environment setup; invest in environment capture and testing.

Evidence RefTable 2, Sec. 5 Results

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy80% (12/15 tools passed all tests)20% (3/15 by OpenHands)+60 ppTM-BENCH (15 tasks)Table 2, Sec. 5 ResultsTable 2
Average per-tool creation cost$0.94$0.15 (OpenHands)+$0.79Per-tool average reported across TM-BENCHTable 3, Sec. 5 ResultsTable 3

What To Try In 7 Days

Run TOOLMAKER on 3 of your org's public GitHub repos to evaluate auto-deployability and measure saved engineering time.

Integrate the generated Dockerized tools into one agent pipeline and run a smoke test using held-out inputs.

Use TM-BENCH or a tiny subset to validate tool correctness before production use.

Agent Features

Memory
Conversation history snapshotsEnvironment snapshots via Docker checkpoint
Planning
LLM planning for implementation stepsPlan + implement + diagnose loop
Tool Use
OS interactions (bash commands, installs)Function-calling API to control agents
Frameworks
OpenAI function-calling APIsDocker checkpointing
Is Agentic

Yes

Architectures
LLM-driven agent (planning + function calling)Dockerized execution sandbox
Collaboration
Structured tool-augmented LLM calls (agents chain LLM and environment interactions)

Optimization Features

Token Efficiency
Use of paper summaries to reduce token use and iterations
System Optimization
Environment snapshotting to reset state between iterations

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

TM-BENCH (linked from paper / GitHub repository)

Risks & Boundaries

Limitations

Assumes referenced repositories are reasonably well-structured, documented, and installable; not guaranteed for arbitrary repos.

TM-BENCH is curated; success may drop on unvetted or broken repositories.

When Not To Use

For non-public or poorly documented repositories that cannot be installed automatically.

Directly in safety-critical clinical or wet-lab decision making without domain expert review.

Failure Modes

Environment installation fails due to missing or fragile external dependencies.

Generated tool hard-codes example invocation and fails to generalize to held-out inputs.

Core Entities

Models

gpt-4o-2024-08-06o3-minio1-mini-2024-09-12Claude 3.5 SonnetOpenHands (baseline)

Metrics

tool correctness (all unit tests pass)per-tool cost (USD)number of agent actionsnumber of self-correcting iterations

Datasets

TM-BENCH (15 tasks, 42 invocations, 124 unit tests)

Benchmarks

TM-BENCHSWE-bench (related baseline benchmark)