Teach LLMs to plan API call sequences, then generate executable code to answer academic queries faster and more reliably.

May 24, 20248 min

Overview

Decision SnapshotNeeds Validation

The approach is practical and tested on a cloned production API; results and a public dataset support its claims, but gains depend on API structure and model choice.

Citations2

Evidence Strength0.85

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 70%

Authors

Yuanchun Wang, Jifan Yu, Zijun Yao, Jing Zhang, Yuyang Xie, Shangqing Tu, Yiyang Fu, Youhe Feng, Jinkai Zhang, Jingyao Zhang, Bowen Huang, Yuanyao Li, Huihui Yuan, Lei Hou, Juanzi Li, Jie Tang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SoAy converts complex multi-API academic queries into a short plan plus executable code, cutting latency and improving answer reliability—useful for search services, institutional dashboards, and any product that needs precise, API-backed facts.

Who Should Care

Summary TLDR

SoAy is a practical method that teaches LLMs to (1) generate a small API call plan (a "solution") and (2) produce executable code that implements that plan. The authors auto-build a dataset (SoAyBench, 3,960 triplets) by enumerating API dependency paths in AMiner, use it to align LLMs both via fine-tuning (SoAyLLaMA) and in-context prompting (SoAyGPT), and evaluate with a custom SoAyEval metric. On the cloned AMiner API testbed, SoAy variants cut inference time and substantially raise correct-answer rates versus prior tool-using baselines; it is already deployed in production and served tens of thousands of requests.

Problem Statement

Existing LLMs that call external academic APIs fail when API calls must be tightly coupled (outputs feeding later inputs) and suffer high latency when using step-by-step decision trees. Researchers need LLMs that both understand API coupling and answer queries quickly.

Main Contribution

SoAy: a two-step method where the LLM first outputs a compact API calling plan (solution) and then generates executable code guided by that plan.

SoAyBench: a publicly released dataset of 3,960 (Query, Solution, Code) triplets built from an AMiner API clone and a test set of 792 fixed questions.

Key Findings

SoAyLLaMA (Code-13B) achieved the top automated score on SoAyBench.

NumbersScore 92.74% (Code-13B, Table 3)

Practical UseFine-tuning an open-code model with solution+code examples can deliver >90% correct answers on the evaluated AMiner tasks; use this route when you can afford training.

Evidence RefTable 3

SoAyGPT with GPT-4 outperforms baselines on correctness.

NumbersScore 86.57% (SoAyGPT, GPT-4, Table 3)

Practical UseIf you cannot fine-tune large closed models, in-context SoAy prompting with GPT-4 yields large accuracy gains versus prior tool-using prompts.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Top automated score (SoAyLLaMA Code-13B)92.74% overall ScoreToolLLaMA / GPT-DFSDT± (See Table 3)SoAyBenchTable 3 reports Code-13B Score 92.74%Table 3
SoAyGPT (GPT-4) automated score86.57% overall ScoreGPT-DFSDT (GPT-4) Score 58.16%+28.41 percentage pointsSoAyBenchTable 3 rows for SoAyGPT and GPT-DFSDT with GPT-4Table 3

What To Try In 7 Days

Clone a small subset of your domain APIs and enumerate dependency paths to create simple solutions.

Generate a handful of (query, solution, code) examples and test in-context prompting with a closed LLM (SoAyGPT-style).

Fine-tune a small code-capable model (CodeLlama-7B) on the triplets to evaluate latency and accuracy trade-offs.

Agent Features

Memory
short-term context of solution and code (no long-term retrieval described)
Planning
solution generation (API call planning)
Tool Use
API-aware code generationsingle-execution code-run for answers
Frameworks
SoAyGPT agent promptsSoAyLLaMA fine-tuning pipeline
Is Agentic

Yes

Architectures
in-context multi-agent prompting (Solution/Code/Answer agents)fine-tuned sequence-to-sequence LLMs (SoAyLLaMA)
Collaboration
modular agents: solution, code, answer

Optimization Features

Token Efficiency
plans reduce repeated reasoning calls, saving token usage
Training Optimization
SFT
Inference Optimization
reduce multi-step LLM calls by generating and executing code once

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Method depends on having structured domain APIs and their coupling graph; no benefit when APIs are absent.

Performance gains are smaller with weaker instruction-following models (GPT-3.5 backbones showed less improvement).

When Not To Use

Your domain lacks stable, queryable APIs or snapshots to verify code execution.

You need open-ended or highly ambiguous answers that require model world knowledge rather than exact API facts.

Failure Modes

Wrong solution selection (model plans incorrect API sequence) leading to incorrect answers.

Correct solution but buggy generated code (WC) causing wrong outputs despite executable code.

Core Entities

Models

gpt-3.5-turbo-0613gpt-3.5-turbo-16k-0613gpt-4-0613Llama-2-7b-chat-hfCodeLlama-7b-Instruct-hfCodeLlama-13b-Instruct-hf

Metrics

SoAyEvalEMDSWSWCEEACCScoreResponse time (s)

Datasets

SoAyBench (3,960 triplets)AMiner cloned API subset (snapshot Sep 23, 2023)

Benchmarks

SoAyBench