Overview
The idea is simple and practical: summarize after each API call to keep context short and track failures; evaluated on a large real-API benchmark but experiments use an oracle retriever and ChatGPT only.
Citations3
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
If your product needs reliable multi-step interactions with many third-party APIs (search, image tools, web services), a small router + summarizing state manager can boost success and reduce repeated failures with little engineering overhead.
Who Should Care
Summary TLDR
Sum2Act is a prompt-driven pipeline that makes an LLM (tested with ChatGPT) pick and call open-world APIs repeatedly while keeping a short, high-density task State (summary + failure history). A Router proposes actions (which API to call or 'Finish') and a State Manager summarizes each API response, records successes/failures, and guides the next step. On the ToolBench benchmark (16k+ real APIs) Sum2Act improves pass rate over ReAct and DFSDT and extends naturally to vision APIs.
Problem Statement
Calling many real-world APIs reliably is hard for LLMs because long raw logs overload context, failed API calls cause error propagation, and tree-search methods can miss useful info from other branches. The paper aims to give LLMs a compact, evolving task state and a two-module pipeline so they can plan, avoid repeating failures, and handle dynamic API responses.
Main Contribution
A two-part pipeline (Router + State Manager) that forces the LLM to summarize results after every API call and to keep a short, dense State with current results and failure history.
An action-proposal loop where the Router picks an API or 'Finish' and the State Manager validates outcomes and records failures so future choices avoid bad tools.
Key Findings
Sum2Act raises average Pass Rate to 70.0% on ToolBench using ChatGPT
Sum2Act wins in pairwise comparisons more often than baselines
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Pass Rate (average across test splits) | 70.0% | DFSDT 67.0%; ReAct 41.1% | +3.0 pp vs DFSDT; +28.9 pp vs ReAct | ToolBench (6 test subsets averaged) | Table 1 reports per-split and average pass rates | Table 1 |
| Win Rate (pairwise) | 67.8% vs ReAct; 54.6% vs DFSDT | ReAct; DFSDT | Sum2Act beats ReAct and DFSDT in pairwise wins on average | ToolBench (pairwise comparisons with evaluator) | Table 2 shows pairwise win rates; ties split per ToolLLM protocol | Table 2 |
What To Try In 7 Days
Wrap your LLM calls in a loop: Router suggests an API or 'Finish', then call the API and record the raw result.
After each call, call the LLM to produce a short State: current results + failure reason when relevant; keep the state short (high information density).
Run end-to-end tests with an oracle or curated set of correct APIs to measure pass/win rates before integrating a retriever.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Experiments use the oracle API retriever, so practical gains depend on retriever quality in real deployments.
Results reported with a single LLM (ChatGPT); behavior may differ with other models.
When Not To Use
When you cannot access a reliable API retriever or ground-truth tool index.
When strict low-latency constraints rule out multiple LLM calls per step.
Failure Modes
If the retriever provides wrong tools, the Router will follow bad paths even with summarization.
State summaries may miss critical details if prompts are weak, causing wrong next actions.

