Overview
The system shows strong performance on a realistic curated benchmark (12/15 tasks) and provides reproducible environment snapshots, but assumes reasonably well-structured repos and needs human oversight for safety-critical or fragile repos.
Citations3
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Automates converting published code into reusable tools, cutting expert setup time and enabling rapid integration of domain-specific methods into production agents at a low per-tool cost.
Who Should Care
Summary TLDR
TOOLMAKER is an agent framework that automatically converts scientific code repositories into Dockerized, callable Python tools for LLM agents. It (1) installs and snapshots environments, (2) implements a Python wrapper function, and (3) uses a closed-loop self-correction loop to debug. The authors release TM-BENCH (15 real scientific tasks, 42 held-out invocations, 124 unit tests). On TM-BENCH TOOLMAKER implemented 12/15 tools (80%); a strong baseline (OpenHands) implemented 3/15 (20%). TOOLMAKER is practical for automating repository deployment but assumes reasonably well-structured repos and still needs human oversight for safety-critical domains.
Problem Statement
LLM agents need external tools to solve complex, multi-step scientific tasks, but humans must still install, configure, and adapt those tools. This manual setup blocks broad automation in domains with many specialized tools (e.g., life sciences). The paper asks: can an LLM agent autonomously turn a published code repo into a reusable, tested tool?
Main Contribution
TOOLMAKER: an agentic two-stage workflow that (a) auto-installs and snapshots environments (Docker) and (b) generates a Python tool function with closed-loop self-correction for debugging.
TM-BENCH: a realistic benchmark of 15 repository-based scientific tasks with 42 held-out invocations and 124 unit tests to measure correctness and robustness.
Key Findings
TOOLMAKER implemented 80% of benchmark tools (12 of 15)
OpenHands baseline implemented 20% (3 of 15)
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 80% (12/15 tools passed all tests) | 20% (3/15 by OpenHands) | +60 pp | TM-BENCH (15 tasks) | Table 2, Sec. 5 Results | Table 2 |
| Average per-tool creation cost | $0.94 | $0.15 (OpenHands) | +$0.79 | Per-tool average reported across TM-BENCH | Table 3, Sec. 5 Results | Table 3 |
What To Try In 7 Days
Run TOOLMAKER on 3 of your org's public GitHub repos to evaluate auto-deployability and measure saved engineering time.
Integrate the generated Dockerized tools into one agent pipeline and run a smoke test using held-out inputs.
Use TM-BENCH or a tiny subset to validate tool correctness before production use.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Assumes referenced repositories are reasonably well-structured, documented, and installable; not guaranteed for arbitrary repos.
TM-BENCH is curated; success may drop on unvetted or broken repositories.
When Not To Use
For non-public or poorly documented repositories that cannot be installed automatically.
Directly in safety-critical clinical or wet-lab decision making without domain expert review.
Failure Modes
Environment installation fails due to missing or fragile external dependencies.
Generated tool hard-codes example invocation and fails to generalize to held-out inputs.

