OctoTools: a training-free planner+executor agent that plugs in tools to boost multi-step reasoning

February 16, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.45

Citation Count

2

Authors

Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, James Zou

Links

Abstract / PDF

Why It Matters For Business

OctoTools turns general LLMs into practical, multi-step assistants by plugging in specialized tools and an explicit planner; this improves correctness on domain tasks and lets teams add domain tools without retraining models.

Summary TLDR

OctoTools is an open-source, training-free agent framework that wraps heterogeneous tools into standardized "tool cards" and runs a planner → command generator → executor loop. On 16 diverse reasoning benchmarks it raises average accuracy from 49.2% (GPT-4o zero-shot) to 58.5% (±std), a +9.3% gain over zero-shot and a +7.7% gain over chain-of-thought prompting. A lightweight greedy toolset optimizer and explicit context trajectory (history) help the system pick useful tools and verify steps. The code and interactive demos are published at the project site.

Problem Statement

Current LLMs struggle on multi-step, cross-domain reasoning because single-step outputs miss specialized perception, calculation, or retrieval. Existing agent/tool frameworks either need training, are domain-specific, or expose limited planning. OctoTools aims to provide a training-free, extensible agent that orchestrates many tool types and explicit multi-step plans to improve complex reasoning.

Main Contribution

A training-free planner-executor agent design that separates high-level planning from command generation and execution.

Standardized tool cards that wrap diverse tools (vision, search, code, domain classifiers) with input/output metadata and limitations.

A greedy task-specific toolset optimization algorithm to pick beneficial tools using a small validation set.

Large-scale evaluation on 16 benchmarks showing consistent accuracy gains and ablation studies that quantify the role of steps and tool selection.

Key Findings

OctoTools raises average accuracy from 49.2% to 58.5% across 16 benchmarks.

NumbersAvg accuracy OctoTools 58.5% vs zero-shot 49.2% (∆ +9.3%)

Optimizing the toolset with a small validation set gives a further boost over the base tool.

NumbersOptimized toolset 58.9% vs OctoTools base 53.9% (∆ +5.0%)

OctoTools outperforms other general agent frameworks when all use the same tools and model.

NumbersAvg 58.5% vs AutoGen 47.9% (∆ +10.6%); vs LangChain 51.2% (∆ +7.3%)

Results

Accuracy

Value58.5%

BaselineGPT-4o zero-shot 49.2%

Gain vs Chain-of-Thought (CoT)

Value∆ +7.7% (avg)

BaselineCoT 50.8%

Toolset optimization effect

Value58.9% (optimized) vs 53.9% (base)

BaselineOctoTools base 53.9%

Comparison vs other agent frameworks

Value58.5% (OctoTools) vs 47.9% (AutoGen), 51.0% (GPT-Functions), 51.2% (LangChain)

BaselineAutoGen 47.9%

Operational cost example

ValueTypically <$5 for 100 queries with max 10 steps (using GPT-4o)

Who Should Care

What To Try In 7 Days

Wrap one or two domain tools (search, calculator, a vision patch-zoomer) as simple tool cards and run a planner-executor loop on 100 validation examples to estimate gains.

Implement the greedy toolset optimizer: try adding each tool to the base set and measure validation delta to remove noisy tools.

Instrument your agent to store full trajectories (actions, commands, outputs) to enable auditing and quick debugging of tool failures.

Agent Features

Memory

  • stores full trajectory (s0...sT) in structured context
  • uses short-term trajectory for next-step planning

Planning

  • high-level plan generation (task decomposition)
  • low-level action prediction per step

Tool Use

  • standardized tool cards with metadata
  • separate command generator to turn actions into Python calls
  • context verifier to check completeness

Frameworks

  • supports easy plug-in of new tools without retraining

Is Agentic

true

Architectures

  • planner → command generator → executor
  • tool-card modular toolbox

Collaboration

  • single-agent workflow (can be extended to multi-agent later)

Optimization Features

Token Efficiency

  • planner and executor separation reduces repeated high-level reasoning tokens

Infra Optimization

  • time budget and step limits configurable (e.g., 300s / 10 steps used in evaluation)

System Optimization

  • executor runs generated commands in isolated Python environment
  • structured trajectory logging for replay and debugging

Training Optimization

  • training-free: no model weight updates required

Inference Optimization

  • greedy toolset selection reduces unnecessary tool calls
  • Accuracy

Reproducibility

Data Urls

  • Datasets are public benchmarks listed in the paper (e.g., VQA 2.0, MedQA, MathVista).

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance depends heavily on the quality and correctness of plugged-in tools; bad tools can harm results.
  • Toolset optimizer is greedy and may miss globally optimal tool combinations.
  • Evaluation uses sampled validation/test subsets (100/200), so some benchmarks have limited statistical power.
  • System relies on external APIs and LLM costs; operational cost may be non-trivial.

When Not To Use

  • When tool quality is unknown or untrusted and you cannot validate tool outputs.
  • When low-latency, tiny-footprint inference is required (LLM+tool orchestration adds latency and cost).
  • When tasks are simple single-step answers where a base LLM already performs well.

Failure Modes

  • Planner suggests inappropriate tools because metadata is incomplete or ambiguous.
  • Command generator creates invalid or unsafe code; executor must sandbox execution.
  • Cascading errors: mistaken tool outputs propagate through subsequent steps and produce confident but wrong final answers.
  • Greedy tool selection includes mildly helpful tools that collectively increase noise for some tasks.

Core Entities

Models

  • gpt-4o-2024-08-06
  • gpt-4o-mini

Metrics

  • Accuracy

Datasets

  • AlgoPuzzleVQA
  • Hallusion-VD
  • PuzzleVQA
  • VQA 2.0
  • Game of 24
  • Omni-MATH
  • CLEVR-Math
  • MathVista
  • GPQA
  • MMLU-Pro
  • SciFIBench
  • MedQA
  • PathCLS
  • PathVQA
  • SLAKE
  • GAIA-Text

Benchmarks

  • 16-benchmark suite (vision, math, scientific, medical, agentic)