Sum2Act: a router + state-manager pipeline that makes LLMs call many real APIs reliably

February 28, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.5

Citation Count

3

Authors

Yulong Liu, Yunlong Yuan, Chunwei Wang, Jianhua Han, Yongqiang Ma, Li Zhang, Nanning Zheng, Hang Xu

Links

Abstract / PDF

Why It Matters For Business

If your product needs reliable multi-step interactions with many third-party APIs (search, image tools, web services), a small router + summarizing state manager can boost success and reduce repeated failures with little engineering overhead.

Summary TLDR

Sum2Act is a prompt-driven pipeline that makes an LLM (tested with ChatGPT) pick and call open-world APIs repeatedly while keeping a short, high-density task State (summary + failure history). A Router proposes actions (which API to call or 'Finish') and a State Manager summarizes each API response, records successes/failures, and guides the next step. On the ToolBench benchmark (16k+ real APIs) Sum2Act improves pass rate over ReAct and DFSDT and extends naturally to vision APIs.

Problem Statement

Calling many real-world APIs reliably is hard for LLMs because long raw logs overload context, failed API calls cause error propagation, and tree-search methods can miss useful info from other branches. The paper aims to give LLMs a compact, evolving task state and a two-module pipeline so they can plan, avoid repeating failures, and handle dynamic API responses.

Main Contribution

A two-part pipeline (Router + State Manager) that forces the LLM to summarize results after every API call and to keep a short, dense State with current results and failure history.

An action-proposal loop where the Router picks an API or 'Finish' and the State Manager validates outcomes and records failures so future choices avoid bad tools.

Empirical evaluation on ToolBench (16k+ real APIs) showing improved Pass Rate and Win Rate versus ReAct and DFSDT, plus demonstration of integrating visual APIs (SDXL, ControlNet, BLIP, InstructPix2Pix).

Key Findings

Sum2Act raises average Pass Rate to 70.0% on ToolBench using ChatGPT

NumbersPass Rate avg: Sum2Act 70.0% vs DFSDT 67.0% vs ReAct 41.1%

Sum2Act wins in pairwise comparisons more often than baselines

NumbersWin Rate avg: Sum2Act vs ReAct 67.8%; Sum2Act vs DFSDT 54.6%

Adding a task-decomposition step gives only small improvements

NumbersPass Rate avg: 70.0% → 70.7%; Win Rate avg: 67.8% → 68.8%

Sum2Act handles vision tools by integrating visual APIs

NumbersVisual APIs used: SDXL, ControlNet, BLIP, InstructPix2Pix (demonstrated cases)

Results

Pass Rate (average across test splits)

Value70.0%

BaselineDFSDT 67.0%; ReAct 41.1%

Win Rate (pairwise)

Value67.8% vs ReAct; 54.6% vs DFSDT

BaselineReAct; DFSDT

Effect of Task Decomposition

ValuePass Rate 70.0% → 70.7%; Win Rate 67.8% → 68.8%

BaselineSum2Act without decomposition

Who Should Care

What To Try In 7 Days

Wrap your LLM calls in a loop: Router suggests an API or 'Finish', then call the API and record the raw result.

After each call, call the LLM to produce a short State: current results + failure reason when relevant; keep the state short (high information density).

Run end-to-end tests with an oracle or curated set of correct APIs to measure pass/win rates before integrating a retriever.

Agent Features

Memory

  • State summarization (short, high-density summary of observations)
  • Failure history to avoid repeating failed tools

Planning

  • Router action proposal (choose tool or Finish)
  • Iterative plan refinement using summarized State

Tool Use

  • Open-world API invocation (16k+ APIs)
  • Visual API integration for image tasks

Frameworks

  • Sum2Act

Is Agentic

true

Architectures

  • Router + State Manager pipeline

Optimization Features

Token Efficiency

  • State summarization reduces context length vs raw memory

Reproducibility

Data Urls

  • ToolBench (ToolLLM benchmark; 16,000+ real APIs from RapidAPI Hub) cited in paper

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Experiments use the oracle API retriever, so practical gains depend on retriever quality in real deployments.
  • Results reported with a single LLM (ChatGPT); behavior may differ with other models.
  • APIs are dynamic; reruns can change results and the paper re-ran baselines because of that variability.

When Not To Use

  • When you cannot access a reliable API retriever or ground-truth tool index.
  • When strict low-latency constraints rule out multiple LLM calls per step.
  • When API responses are uniformly noisy and summarization cannot extract task-relevant signals.

Failure Modes

  • If the retriever provides wrong tools, the Router will follow bad paths even with summarization.
  • State summaries may miss critical details if prompts are weak, causing wrong next actions.
  • Dynamic API outputs can cause non-deterministic behavior and require re-evaluation of Win/Pass over time.

Core Entities

Models

  • ChatGPT

Metrics

  • Pass Rate
  • Win Rate

Datasets

  • ToolBench (ToolLLM benchmark, 16k+ real APIs)

Benchmarks

  • ToolBench