Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
3
Why It Matters For Business
Automates converting published code into reusable tools, cutting expert setup time and enabling rapid integration of domain-specific methods into production agents at a low per-tool cost.
Summary TLDR
TOOLMAKER is an agent framework that automatically converts scientific code repositories into Dockerized, callable Python tools for LLM agents. It (1) installs and snapshots environments, (2) implements a Python wrapper function, and (3) uses a closed-loop self-correction loop to debug. The authors release TM-BENCH (15 real scientific tasks, 42 held-out invocations, 124 unit tests). On TM-BENCH TOOLMAKER implemented 12/15 tools (80%); a strong baseline (OpenHands) implemented 3/15 (20%). TOOLMAKER is practical for automating repository deployment but assumes reasonably well-structured repos and still needs human oversight for safety-critical domains.
Problem Statement
LLM agents need external tools to solve complex, multi-step scientific tasks, but humans must still install, configure, and adapt those tools. This manual setup blocks broad automation in domains with many specialized tools (e.g., life sciences). The paper asks: can an LLM agent autonomously turn a published code repo into a reusable, tested tool?
Main Contribution
TOOLMAKER: an agentic two-stage workflow that (a) auto-installs and snapshots environments (Docker) and (b) generates a Python tool function with closed-loop self-correction for debugging.
TM-BENCH: a realistic benchmark of 15 repository-based scientific tasks with 42 held-out invocations and 124 unit tests to measure correctness and robustness.
Empirical result: TOOLMAKER correctly implemented 12/15 tasks (80%) vs OpenHands baseline 3/15 (20%), demonstrating substantially improved real-world repository handling.
Ablations showing paper summaries reduce iteration count and cheaper LLMs trade cost for lower success.
Key Findings
TOOLMAKER implemented 80% of benchmark tools (12 of 15)
OpenHands baseline implemented 20% (3 of 15)
TM-BENCH contains 15 tasks, 42 held-out invocations, 124 unit tests
Average tool creation cost ≈ $0.94 and 21.8 actions for TOOLMAKER
Including paper summaries reduced iterations and actions but did not increase success rate
Results
Accuracy
Average per-tool creation cost
Average actions per tool creation
Benchmark scale
Who Should Care
What To Try In 7 Days
Run TOOLMAKER on 3 of your org's public GitHub repos to evaluate auto-deployability and measure saved engineering time.
Integrate the generated Dockerized tools into one agent pipeline and run a smoke test using held-out inputs.
Use TM-BENCH or a tiny subset to validate tool correctness before production use.
Agent Features
Memory
- Conversation history snapshots
- Environment snapshots via Docker checkpoint
Planning
- LLM planning for implementation steps
- Plan + implement + diagnose loop
Tool Use
- OS interactions (bash commands, installs)
- Function-calling API to control agents
Frameworks
- OpenAI function-calling APIs
- Docker checkpointing
Is Agentic
true
Architectures
- LLM-driven agent (planning + function calling)
- Dockerized execution sandbox
Collaboration
- Structured tool-augmented LLM calls (agents chain LLM and environment interactions)
Optimization Features
Token Efficiency
- Use of paper summaries to reduce token use and iterations
System Optimization
- Environment snapshotting to reset state between iterations
Reproducibility
Data Urls
- TM-BENCH (linked from paper / GitHub repository)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Assumes referenced repositories are reasonably well-structured, documented, and installable; not guaranteed for arbitrary repos.
- TM-BENCH is curated; success may drop on unvetted or broken repositories.
- Unit tests do not guarantee correctness in all real-world edge cases or safety-critical settings.
- Does not replace physical experiments; limited to in-silico tasks.
When Not To Use
- For non-public or poorly documented repositories that cannot be installed automatically.
- Directly in safety-critical clinical or wet-lab decision making without domain expert review.
- Where the example invocation is unrepresentative of real inputs (may cause hidden failures).
Failure Modes
- Environment installation fails due to missing or fragile external dependencies.
- Generated tool hard-codes example invocation and fails to generalize to held-out inputs.
- LLM misses corner cases (e.g., masked tokens) in example invocation leading to failed tests.
- External repo changes (deletion, force-push) break reproducibility.
Core Entities
Models
- gpt-4o-2024-08-06
- o3-mini
- o1-mini-2024-09-12
- Claude 3.5 Sonnet
- OpenHands (baseline)
Metrics
- tool correctness (all unit tests pass)
- per-tool cost (USD)
- number of agent actions
- number of self-correcting iterations
Datasets
- TM-BENCH (15 tasks, 42 invocations, 124 unit tests)
Benchmarks
- TM-BENCH
- SWE-bench (related baseline benchmark)

