Sum2Act: a router + state-manager pipeline that makes LLMs call many real APIs reliably

February 28, 20248 min

Overview

Decision SnapshotNeeds Validation

The idea is simple and practical: summarize after each API call to keep context short and track failures; evaluated on a large real-API benchmark but experiments use an oracle retriever and ChatGPT only.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 50%

Authors

Yulong Liu, Yunlong Yuan, Chunwei Wang, Jianhua Han, Yongqiang Ma, Li Zhang, Nanning Zheng, Hang Xu

Links

Abstract / PDF / Data

Why It Matters For Business

If your product needs reliable multi-step interactions with many third-party APIs (search, image tools, web services), a small router + summarizing state manager can boost success and reduce repeated failures with little engineering overhead.

Who Should Care

Summary TLDR

Sum2Act is a prompt-driven pipeline that makes an LLM (tested with ChatGPT) pick and call open-world APIs repeatedly while keeping a short, high-density task State (summary + failure history). A Router proposes actions (which API to call or 'Finish') and a State Manager summarizes each API response, records successes/failures, and guides the next step. On the ToolBench benchmark (16k+ real APIs) Sum2Act improves pass rate over ReAct and DFSDT and extends naturally to vision APIs.

Problem Statement

Calling many real-world APIs reliably is hard for LLMs because long raw logs overload context, failed API calls cause error propagation, and tree-search methods can miss useful info from other branches. The paper aims to give LLMs a compact, evolving task state and a two-module pipeline so they can plan, avoid repeating failures, and handle dynamic API responses.

Main Contribution

A two-part pipeline (Router + State Manager) that forces the LLM to summarize results after every API call and to keep a short, dense State with current results and failure history.

An action-proposal loop where the Router picks an API or 'Finish' and the State Manager validates outcomes and records failures so future choices avoid bad tools.

Key Findings

Sum2Act raises average Pass Rate to 70.0% on ToolBench using ChatGPT

NumbersPass Rate avg: Sum2Act 70.0% vs DFSDT 67.0% vs ReAct 41.1%

Practical UseIf you need more successful end-to-end API-driven solutions, adding a summarizing state manager and router logic can give a reliable ~3 pp pass-rate gain over DFSDT and a much larger gain versus ReAct on ToolBench.

Evidence RefTable 1

Sum2Act wins in pairwise comparisons more often than baselines

NumbersWin Rate avg: Sum2Act vs ReAct 67.8%; Sum2Act vs DFSDT 54.6%

Practical UseUsing state summarization improves not just raw success but overall solution quality and efficiency compared against other prompting methods on evaluated tasks.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pass Rate (average across test splits)70.0%DFSDT 67.0%; ReAct 41.1%+3.0 pp vs DFSDT; +28.9 pp vs ReActToolBench (6 test subsets averaged)Table 1 reports per-split and average pass ratesTable 1
Win Rate (pairwise)67.8% vs ReAct; 54.6% vs DFSDTReAct; DFSDTSum2Act beats ReAct and DFSDT in pairwise wins on averageToolBench (pairwise comparisons with evaluator)Table 2 shows pairwise win rates; ties split per ToolLLM protocolTable 2

What To Try In 7 Days

Wrap your LLM calls in a loop: Router suggests an API or 'Finish', then call the API and record the raw result.

After each call, call the LLM to produce a short State: current results + failure reason when relevant; keep the state short (high information density).

Run end-to-end tests with an oracle or curated set of correct APIs to measure pass/win rates before integrating a retriever.

Agent Features

Memory
State summarization (short, high-density summary of observations)Failure history to avoid repeating failed tools
Planning
Router action proposal (choose tool or Finish)Iterative plan refinement using summarized State
Tool Use
Open-world API invocation (16k+ APIs)Visual API integration for image tasks
Frameworks
Sum2Act
Is Agentic

Yes

Architectures
Router + State Manager pipeline

Optimization Features

Token Efficiency
State summarization reduces context length vs raw memory

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

ToolBench (ToolLLM benchmark; 16,000+ real APIs from RapidAPI Hub) cited in paper

Risks & Boundaries

Limitations

Experiments use the oracle API retriever, so practical gains depend on retriever quality in real deployments.

Results reported with a single LLM (ChatGPT); behavior may differ with other models.

When Not To Use

When you cannot access a reliable API retriever or ground-truth tool index.

When strict low-latency constraints rule out multiple LLM calls per step.

Failure Modes

If the retriever provides wrong tools, the Router will follow bad paths even with summarization.

State summaries may miss critical details if prompts are weak, causing wrong next actions.

Core Entities

Models

ChatGPT

Metrics

Pass RateWin Rate

Datasets

ToolBench (ToolLLM benchmark, 16k+ real APIs)

Benchmarks

ToolBench