Teach LLMs to plan API call sequences, then generate executable code to answer academic queries faster and more reliably.

Overview

Decision SnapshotNeeds Validation

The approach is practical and tested on a cloned production API; results and a public dataset support its claims, but gains depend on API structure and model choice.

Citations2

Evidence Strength0.85

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 70%

Authors

Yuanchun Wang, Jifan Yu, Zijun Yao, Jing Zhang, Yuyang Xie, Shangqing Tu, Yiyang Fu, Youhe Feng, Jinkai Zhang, Jingyao Zhang, Bowen Huang, Yuanyao Li, Huihui Yuan, Lei Hou, Juanzi Li, Jie Tang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SoAy converts complex multi-API academic queries into a short plan plus executable code, cutting latency and improving answer reliability—useful for search services, institutional dashboards, and any product that needs precise, API-backed facts.

Who Should Care

Product Manager ML Engineer Founder Data Scientist Engineering Lead

Summary TLDR

SoAy is a practical method that teaches LLMs to (1) generate a small API call plan (a "solution") and (2) produce executable code that implements that plan. The authors auto-build a dataset (SoAyBench, 3,960 triplets) by enumerating API dependency paths in AMiner, use it to align LLMs both via fine-tuning (SoAyLLaMA) and in-context prompting (SoAyGPT), and evaluate with a custom SoAyEval metric. On the cloned AMiner API testbed, SoAy variants cut inference time and substantially raise correct-answer rates versus prior tool-using baselines; it is already deployed in production and served tens of thousands of requests.

Problem Statement

Existing LLMs that call external academic APIs fail when API calls must be tightly coupled (outputs feeding later inputs) and suffer high latency when using step-by-step decision trees. Researchers need LLMs that both understand API coupling and answer queries quickly.

Main Contribution

SoAy: a two-step method where the LLM first outputs a compact API calling plan (solution) and then generates executable code guided by that plan.

SoAyBench: a publicly released dataset of 3,960 (Query, Solution, Code) triplets built from an AMiner API clone and a test set of 792 fixed questions.

Key Findings

SoAyLLaMA (Code-13B) achieved the top automated score on SoAyBench.

NumbersScore 92.74% (Code-13B, Table 3)

Practical UseFine-tuning an open-code model with solution+code examples can deliver >90% correct answers on the evaluated AMiner tasks; use this route when you can afford training.

Evidence RefTable 3

SoAyGPT with GPT-4 outperforms baselines on correctness.

NumbersScore 86.57% (SoAyGPT, GPT-4, Table 3)

Practical UseIf you cannot fine-tune large closed models, in-context SoAy prompting with GPT-4 yields large accuracy gains versus prior tool-using prompts.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Top automated score (SoAyLLaMA Code-13B)	92.74% overall Score	ToolLLaMA / GPT-DFSDT	± (See Table 3)	SoAyBench	Table 3 reports Code-13B Score 92.74%	Table 3
SoAyGPT (GPT-4) automated score	86.57% overall Score	GPT-DFSDT (GPT-4) Score 58.16%	+28.41 percentage points	SoAyBench	Table 3 rows for SoAyGPT and GPT-DFSDT with GPT-4	Table 3

What To Try In 7 Days

Clone a small subset of your domain APIs and enumerate dependency paths to create simple solutions.

Generate a handful of (query, solution, code) examples and test in-context prompting with a closed LLM (SoAyGPT-style).

Fine-tune a small code-capable model (CodeLlama-7B) on the triplets to evaluate latency and accuracy trade-offs.

Agent Features

Memory

short-term context of solution and code (no long-term retrieval described)

Planning

solution generation (API call planning)

Tool Use

API-aware code generationsingle-execution code-run for answers

Frameworks

SoAyGPT agent promptsSoAyLLaMA fine-tuning pipeline

Is Agentic

Yes

Architectures

in-context multi-agent prompting (Solution/Code/Answer agents)fine-tuned sequence-to-sequence LLMs (SoAyLLaMA)

Collaboration

modular agents: solution, code, answer

Optimization Features

Token Efficiency

plans reduce repeated reasoning calls, saving token usage

Training Optimization

SFT

Inference Optimization

reduce multi-step LLM calls by generating and executing code once

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/RUCKBReasoning/SoAy

Data URLs

https://github.com/RUCKBReasoning/SoAy (SoAyBench and API clone info)

Risks & Boundaries

Limitations

Method depends on having structured domain APIs and their coupling graph; no benefit when APIs are absent.

Performance gains are smaller with weaker instruction-following models (GPT-3.5 backbones showed less improvement).

When Not To Use

Your domain lacks stable, queryable APIs or snapshots to verify code execution.

You need open-ended or highly ambiguous answers that require model world knowledge rather than exact API facts.

Failure Modes

Wrong solution selection (model plans incorrect API sequence) leading to incorrect answers.

Correct solution but buggy generated code (WC) causing wrong outputs despite executable code.

Core Entities

Models

gpt-3.5-turbo-0613gpt-3.5-turbo-16k-0613gpt-4-0613Llama-2-7b-chat-hfCodeLlama-7b-Instruct-hfCodeLlama-13b-Instruct-hf

Metrics

SoAyEvalEMDSWSWCEEACCScoreResponse time (s)

Datasets

SoAyBench (3,960 triplets)AMiner cloned API subset (snapshot Sep 23, 2023)

Benchmarks

SoAyBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SoAyLLaMA (Code-13B) achieved the top automated score on SoAyBench.

SoAyGPT with GPT-4 outperforms baselines on correctness.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding