Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.5
Citation Count
3
Why It Matters For Business
If your product needs reliable multi-step interactions with many third-party APIs (search, image tools, web services), a small router + summarizing state manager can boost success and reduce repeated failures with little engineering overhead.
Summary TLDR
Sum2Act is a prompt-driven pipeline that makes an LLM (tested with ChatGPT) pick and call open-world APIs repeatedly while keeping a short, high-density task State (summary + failure history). A Router proposes actions (which API to call or 'Finish') and a State Manager summarizes each API response, records successes/failures, and guides the next step. On the ToolBench benchmark (16k+ real APIs) Sum2Act improves pass rate over ReAct and DFSDT and extends naturally to vision APIs.
Problem Statement
Calling many real-world APIs reliably is hard for LLMs because long raw logs overload context, failed API calls cause error propagation, and tree-search methods can miss useful info from other branches. The paper aims to give LLMs a compact, evolving task state and a two-module pipeline so they can plan, avoid repeating failures, and handle dynamic API responses.
Main Contribution
A two-part pipeline (Router + State Manager) that forces the LLM to summarize results after every API call and to keep a short, dense State with current results and failure history.
An action-proposal loop where the Router picks an API or 'Finish' and the State Manager validates outcomes and records failures so future choices avoid bad tools.
Empirical evaluation on ToolBench (16k+ real APIs) showing improved Pass Rate and Win Rate versus ReAct and DFSDT, plus demonstration of integrating visual APIs (SDXL, ControlNet, BLIP, InstructPix2Pix).
Key Findings
Sum2Act raises average Pass Rate to 70.0% on ToolBench using ChatGPT
Sum2Act wins in pairwise comparisons more often than baselines
Adding a task-decomposition step gives only small improvements
Sum2Act handles vision tools by integrating visual APIs
Results
Pass Rate (average across test splits)
Win Rate (pairwise)
Effect of Task Decomposition
Who Should Care
What To Try In 7 Days
Wrap your LLM calls in a loop: Router suggests an API or 'Finish', then call the API and record the raw result.
After each call, call the LLM to produce a short State: current results + failure reason when relevant; keep the state short (high information density).
Run end-to-end tests with an oracle or curated set of correct APIs to measure pass/win rates before integrating a retriever.
Agent Features
Memory
- State summarization (short, high-density summary of observations)
- Failure history to avoid repeating failed tools
Planning
- Router action proposal (choose tool or Finish)
- Iterative plan refinement using summarized State
Tool Use
- Open-world API invocation (16k+ APIs)
- Visual API integration for image tasks
Frameworks
- Sum2Act
Is Agentic
true
Architectures
- Router + State Manager pipeline
Optimization Features
Token Efficiency
- State summarization reduces context length vs raw memory
Reproducibility
Data Urls
- ToolBench (ToolLLM benchmark; 16,000+ real APIs from RapidAPI Hub) cited in paper
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Experiments use the oracle API retriever, so practical gains depend on retriever quality in real deployments.
- Results reported with a single LLM (ChatGPT); behavior may differ with other models.
- APIs are dynamic; reruns can change results and the paper re-ran baselines because of that variability.
When Not To Use
- When you cannot access a reliable API retriever or ground-truth tool index.
- When strict low-latency constraints rule out multiple LLM calls per step.
- When API responses are uniformly noisy and summarization cannot extract task-relevant signals.
Failure Modes
- If the retriever provides wrong tools, the Router will follow bad paths even with summarization.
- State summaries may miss critical details if prompts are weak, causing wrong next actions.
- Dynamic API outputs can cause non-deterministic behavior and require re-evaluation of Win/Pass over time.
Core Entities
Models
- ChatGPT
Metrics
- Pass Rate
- Win Rate
Datasets
- ToolBench (ToolLLM benchmark, 16k+ real APIs)
Benchmarks
- ToolBench

