TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

February 17, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

3

Authors

Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather

Links

Abstract / PDF

Why It Matters For Business

Automates converting published code into reusable tools, cutting expert setup time and enabling rapid integration of domain-specific methods into production agents at a low per-tool cost.

Summary TLDR

TOOLMAKER is an agent framework that automatically converts scientific code repositories into Dockerized, callable Python tools for LLM agents. It (1) installs and snapshots environments, (2) implements a Python wrapper function, and (3) uses a closed-loop self-correction loop to debug. The authors release TM-BENCH (15 real scientific tasks, 42 held-out invocations, 124 unit tests). On TM-BENCH TOOLMAKER implemented 12/15 tools (80%); a strong baseline (OpenHands) implemented 3/15 (20%). TOOLMAKER is practical for automating repository deployment but assumes reasonably well-structured repos and still needs human oversight for safety-critical domains.

Problem Statement

LLM agents need external tools to solve complex, multi-step scientific tasks, but humans must still install, configure, and adapt those tools. This manual setup blocks broad automation in domains with many specialized tools (e.g., life sciences). The paper asks: can an LLM agent autonomously turn a published code repo into a reusable, tested tool?

Main Contribution

TOOLMAKER: an agentic two-stage workflow that (a) auto-installs and snapshots environments (Docker) and (b) generates a Python tool function with closed-loop self-correction for debugging.

TM-BENCH: a realistic benchmark of 15 repository-based scientific tasks with 42 held-out invocations and 124 unit tests to measure correctness and robustness.

Empirical result: TOOLMAKER correctly implemented 12/15 tasks (80%) vs OpenHands baseline 3/15 (20%), demonstrating substantially improved real-world repository handling.

Ablations showing paper summaries reduce iteration count and cheaper LLMs trade cost for lower success.

Key Findings

TOOLMAKER implemented 80% of benchmark tools (12 of 15)

Numbers12/15 tools correct

OpenHands baseline implemented 20% (3 of 15)

Numbers3/15 tools correct

TM-BENCH contains 15 tasks, 42 held-out invocations, 124 unit tests

Numbers15 tasks, 42 invocations, 124 tests

Average tool creation cost ≈ $0.94 and 21.8 actions for TOOLMAKER

Numbersavg cost $0.94; 21.8 actions

Including paper summaries reduced iterations and actions but did not increase success rate

Numbersiterations decreased (e.g., stamp task 9 → 5), tools correct 12→11

Results

Accuracy

Value80% (12/15 tools passed all tests)

Baseline20% (3/15 by OpenHands)

Average per-tool creation cost

Value$0.94

Baseline$0.15 (OpenHands)

Average actions per tool creation

Value21.8 actions

Baseline7.5 actions (OpenHands)

Benchmark scale

Value15 tasks, 42 test invocations, 124 unit tests

Who Should Care

What To Try In 7 Days

Run TOOLMAKER on 3 of your org's public GitHub repos to evaluate auto-deployability and measure saved engineering time.

Integrate the generated Dockerized tools into one agent pipeline and run a smoke test using held-out inputs.

Use TM-BENCH or a tiny subset to validate tool correctness before production use.

Agent Features

Memory

  • Conversation history snapshots
  • Environment snapshots via Docker checkpoint

Planning

  • LLM planning for implementation steps
  • Plan + implement + diagnose loop

Tool Use

  • OS interactions (bash commands, installs)
  • Function-calling API to control agents

Frameworks

  • OpenAI function-calling APIs
  • Docker checkpointing

Is Agentic

true

Architectures

  • LLM-driven agent (planning + function calling)
  • Dockerized execution sandbox

Collaboration

  • Structured tool-augmented LLM calls (agents chain LLM and environment interactions)

Optimization Features

Token Efficiency

  • Use of paper summaries to reduce token use and iterations

System Optimization

  • Environment snapshotting to reset state between iterations

Reproducibility

Data Urls

  • TM-BENCH (linked from paper / GitHub repository)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Assumes referenced repositories are reasonably well-structured, documented, and installable; not guaranteed for arbitrary repos.
  • TM-BENCH is curated; success may drop on unvetted or broken repositories.
  • Unit tests do not guarantee correctness in all real-world edge cases or safety-critical settings.
  • Does not replace physical experiments; limited to in-silico tasks.

When Not To Use

  • For non-public or poorly documented repositories that cannot be installed automatically.
  • Directly in safety-critical clinical or wet-lab decision making without domain expert review.
  • Where the example invocation is unrepresentative of real inputs (may cause hidden failures).

Failure Modes

  • Environment installation fails due to missing or fragile external dependencies.
  • Generated tool hard-codes example invocation and fails to generalize to held-out inputs.
  • LLM misses corner cases (e.g., masked tokens) in example invocation leading to failed tests.
  • External repo changes (deletion, force-push) break reproducibility.

Core Entities

Models

  • gpt-4o-2024-08-06
  • o3-mini
  • o1-mini-2024-09-12
  • Claude 3.5 Sonnet
  • OpenHands (baseline)

Metrics

  • tool correctness (all unit tests pass)
  • per-tool cost (USD)
  • number of agent actions
  • number of self-correcting iterations

Datasets

  • TM-BENCH (15 tasks, 42 invocations, 124 unit tests)

Benchmarks

  • TM-BENCH
  • SWE-bench (related baseline benchmark)