Teach LLMs to plan API call sequences, then generate executable code to answer academic queries faster and more reliably.

May 24, 20248 min

Overview

Production Readiness

0.8

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

2

Authors

Yuanchun Wang, Jifan Yu, Zijun Yao, Jing Zhang, Yuyang Xie, Shangqing Tu, Yiyang Fu, Youhe Feng, Jinkai Zhang, Jingyao Zhang, Bowen Huang, Yuanyao Li, Huihui Yuan, Lei Hou, Juanzi Li, Jie Tang

Links

Abstract / PDF

Why It Matters For Business

SoAy converts complex multi-API academic queries into a short plan plus executable code, cutting latency and improving answer reliability—useful for search services, institutional dashboards, and any product that needs precise, API-backed facts.

Summary TLDR

SoAy is a practical method that teaches LLMs to (1) generate a small API call plan (a "solution") and (2) produce executable code that implements that plan. The authors auto-build a dataset (SoAyBench, 3,960 triplets) by enumerating API dependency paths in AMiner, use it to align LLMs both via fine-tuning (SoAyLLaMA) and in-context prompting (SoAyGPT), and evaluate with a custom SoAyEval metric. On the cloned AMiner API testbed, SoAy variants cut inference time and substantially raise correct-answer rates versus prior tool-using baselines; it is already deployed in production and served tens of thousands of requests.

Problem Statement

Existing LLMs that call external academic APIs fail when API calls must be tightly coupled (outputs feeding later inputs) and suffer high latency when using step-by-step decision trees. Researchers need LLMs that both understand API coupling and answer queries quickly.

Main Contribution

SoAy: a two-step method where the LLM first outputs a compact API calling plan (solution) and then generates executable code guided by that plan.

SoAyBench: a publicly released dataset of 3,960 (Query, Solution, Code) triplets built from an AMiner API clone and a test set of 792 fixed questions.

SoAyEval: a fine-grained evaluation protocol that checks solution correctness, code execution, and answer correctness (EM, DS, WS, WC, EE).

Two alignment recipes: SoAyGPT (prompted, for closed models) and SoAyLLaMA (supervised fine-tuning for open models).

Deployment and user study showing practical usability and lower latency compared to multi-step baselines.

Key Findings

SoAyLLaMA (Code-13B) achieved the top automated score on SoAyBench.

NumbersScore 92.74% (Code-13B, Table 3)

SoAyGPT with GPT-4 outperforms baselines on correctness.

NumbersScore 86.57% (SoAyGPT, GPT-4, Table 3)

SoAy reduces average response time versus multi-step decision-tree baselines.

NumbersSoAyGPT: 26.05s vs GPT-DFSDT: 70.92s (GPT-4 backbones, Table 4)

Authors created a sizeable training/test resource automatically.

Numbers3,960 triplets total; test set = 792 samples (one-fifth)

In online human evaluation, SoAy answers were preferred over raw GPT-4 and human experts for exact queries.

NumbersSoAy received the largest vote share across 52 live queries (Figure 5)

Results

Top automated score (SoAyLLaMA Code-13B)

Value92.74% overall Score

BaselineToolLLaMA / GPT-DFSDT

SoAyGPT (GPT-4) automated score

Value86.57% overall Score

BaselineGPT-DFSDT (GPT-4) Score 58.16%

Average response time

ValueSoAyGPT (GPT-4) 26.05s; SoAyGPT (3.5) ~6.40s; SoAyLLaMA Code-7B 1.12s

BaselineGPT-DFSDT (GPT-4) 70.92s; GPT-DFSDT (3.5-16k) 53.73s

Dataset size

Value3,960 triplets (44 combinations × 3 templates × 30 instantiations)

Deployment usage

Value54,800+ accesses

Who Should Care

What To Try In 7 Days

Clone a small subset of your domain APIs and enumerate dependency paths to create simple solutions.

Generate a handful of (query, solution, code) examples and test in-context prompting with a closed LLM (SoAyGPT-style).

Fine-tune a small code-capable model (CodeLlama-7B) on the triplets to evaluate latency and accuracy trade-offs.

Agent Features

Memory

  • short-term context of solution and code (no long-term retrieval described)

Planning

  • solution generation (API call planning)

Tool Use

  • API-aware code generation
  • single-execution code-run for answers

Frameworks

  • SoAyGPT agent prompts
  • SoAyLLaMA fine-tuning pipeline

Is Agentic

true

Architectures

  • in-context multi-agent prompting (Solution/Code/Answer agents)
  • fine-tuned sequence-to-sequence LLMs (SoAyLLaMA)

Collaboration

  • modular agents: solution, code, answer

Optimization Features

Token Efficiency

  • plans reduce repeated reasoning calls, saving token usage

Training Optimization

  • SFT

Inference Optimization

  • reduce multi-step LLM calls by generating and executing code once

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Method depends on having structured domain APIs and their coupling graph; no benefit when APIs are absent.
  • Performance gains are smaller with weaker instruction-following models (GPT-3.5 backbones showed less improvement).
  • SoAyBench initially biases prompts toward two-hop solutions, which can skew models.
  • Ambiguous, subjective questions are not improved by API-backed exact-answer generation.

When Not To Use

  • Your domain lacks stable, queryable APIs or snapshots to verify code execution.
  • You need open-ended or highly ambiguous answers that require model world knowledge rather than exact API facts.
  • Low-latency constraints where code execution overhead exceeds budget and no lightweight model is available.

Failure Modes

  • Wrong solution selection (model plans incorrect API sequence) leading to incorrect answers.
  • Correct solution but buggy generated code (WC) causing wrong outputs despite executable code.
  • Execution errors (EE) due to non-executable code, network, or API changes.
  • Prompt bias from training data that favors certain hop counts or patterns.

Core Entities

Models

  • gpt-3.5-turbo-0613
  • gpt-3.5-turbo-16k-0613
  • gpt-4-0613
  • Llama-2-7b-chat-hf
  • CodeLlama-7b-Instruct-hf
  • CodeLlama-13b-Instruct-hf

Metrics

  • SoAyEval
  • EM
  • DS
  • WS
  • WC
  • EE
  • ACC
  • Score
  • Response time (s)

Datasets

  • SoAyBench (3,960 triplets)
  • AMiner cloned API subset (snapshot Sep 23, 2023)

Benchmarks

  • SoAyBench