Sum2Act: a router + state-manager pipeline that makes LLMs call many real APIs reliably

Overview

Decision SnapshotNeeds Validation

The idea is simple and practical: summarize after each API call to keep context short and track failures; evaluated on a large real-API benchmark but experiments use an oracle retriever and ChatGPT only.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 50%

Authors

Yulong Liu, Yunlong Yuan, Chunwei Wang, Jianhua Han, Yongqiang Ma, Li Zhang, Nanning Zheng, Hang Xu

Links

Abstract / PDF / Data

Why It Matters For Business

If your product needs reliable multi-step interactions with many third-party APIs (search, image tools, web services), a small router + summarizing state manager can boost success and reduce repeated failures with little engineering overhead.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO

Summary TLDR

Sum2Act is a prompt-driven pipeline that makes an LLM (tested with ChatGPT) pick and call open-world APIs repeatedly while keeping a short, high-density task State (summary + failure history). A Router proposes actions (which API to call or 'Finish') and a State Manager summarizes each API response, records successes/failures, and guides the next step. On the ToolBench benchmark (16k+ real APIs) Sum2Act improves pass rate over ReAct and DFSDT and extends naturally to vision APIs.

Problem Statement

Calling many real-world APIs reliably is hard for LLMs because long raw logs overload context, failed API calls cause error propagation, and tree-search methods can miss useful info from other branches. The paper aims to give LLMs a compact, evolving task state and a two-module pipeline so they can plan, avoid repeating failures, and handle dynamic API responses.

Main Contribution

A two-part pipeline (Router + State Manager) that forces the LLM to summarize results after every API call and to keep a short, dense State with current results and failure history.

An action-proposal loop where the Router picks an API or 'Finish' and the State Manager validates outcomes and records failures so future choices avoid bad tools.

Key Findings

Sum2Act raises average Pass Rate to 70.0% on ToolBench using ChatGPT

NumbersPass Rate avg: Sum2Act 70.0% vs DFSDT 67.0% vs ReAct 41.1%

Practical UseIf you need more successful end-to-end API-driven solutions, adding a summarizing state manager and router logic can give a reliable ~3 pp pass-rate gain over DFSDT and a much larger gain versus ReAct on ToolBench.

Evidence RefTable 1

Sum2Act wins in pairwise comparisons more often than baselines

NumbersWin Rate avg: Sum2Act vs ReAct 67.8%; Sum2Act vs DFSDT 54.6%

Practical UseUsing state summarization improves not just raw success but overall solution quality and efficiency compared against other prompting methods on evaluated tasks.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pass Rate (average across test splits)	70.0%	DFSDT 67.0%; ReAct 41.1%	+3.0 pp vs DFSDT; +28.9 pp vs ReAct	ToolBench (6 test subsets averaged)	Table 1 reports per-split and average pass rates	Table 1
Win Rate (pairwise)	67.8% vs ReAct; 54.6% vs DFSDT	ReAct; DFSDT	Sum2Act beats ReAct and DFSDT in pairwise wins on average	ToolBench (pairwise comparisons with evaluator)	Table 2 shows pairwise win rates; ties split per ToolLLM protocol	Table 2

What To Try In 7 Days

Wrap your LLM calls in a loop: Router suggests an API or 'Finish', then call the API and record the raw result.

After each call, call the LLM to produce a short State: current results + failure reason when relevant; keep the state short (high information density).

Run end-to-end tests with an oracle or curated set of correct APIs to measure pass/win rates before integrating a retriever.

Agent Features

Memory

State summarization (short, high-density summary of observations)Failure history to avoid repeating failed tools

Planning

Router action proposal (choose tool or Finish)Iterative plan refinement using summarized State

Tool Use

Open-world API invocation (16k+ APIs)Visual API integration for image tasks

Frameworks

Sum2Act

Is Agentic

Yes

Architectures

Router + State Manager pipeline

Optimization Features

Token Efficiency

State summarization reduces context length vs raw memory

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

ToolBench (ToolLLM benchmark; 16,000+ real APIs from RapidAPI Hub) cited in paper

Risks & Boundaries

Limitations

Experiments use the oracle API retriever, so practical gains depend on retriever quality in real deployments.

Results reported with a single LLM (ChatGPT); behavior may differ with other models.

When Not To Use

When you cannot access a reliable API retriever or ground-truth tool index.

When strict low-latency constraints rule out multiple LLM calls per step.

Failure Modes

If the retriever provides wrong tools, the Router will follow bad paths even with summarization.

State summaries may miss critical details if prompts are weak, causing wrong next actions.

Core Entities

Models

ChatGPT

Metrics

Pass RateWin Rate

Datasets

ToolBench (ToolLLM benchmark, 16k+ real APIs)

Benchmarks

ToolBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Sum2Act raises average Pass Rate to 70.0% on ToolBench using ChatGPT

Sum2Act wins in pairwise comparisons more often than baselines

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding