Overview
The approach is practical and tested on a cloned production API; results and a public dataset support its claims, but gains depend on API structure and model choice.
Citations2
Evidence Strength0.85
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 80%
Novelty: 70%
Why It Matters For Business
SoAy converts complex multi-API academic queries into a short plan plus executable code, cutting latency and improving answer reliability—useful for search services, institutional dashboards, and any product that needs precise, API-backed facts.
Who Should Care
Summary TLDR
SoAy is a practical method that teaches LLMs to (1) generate a small API call plan (a "solution") and (2) produce executable code that implements that plan. The authors auto-build a dataset (SoAyBench, 3,960 triplets) by enumerating API dependency paths in AMiner, use it to align LLMs both via fine-tuning (SoAyLLaMA) and in-context prompting (SoAyGPT), and evaluate with a custom SoAyEval metric. On the cloned AMiner API testbed, SoAy variants cut inference time and substantially raise correct-answer rates versus prior tool-using baselines; it is already deployed in production and served tens of thousands of requests.
Problem Statement
Existing LLMs that call external academic APIs fail when API calls must be tightly coupled (outputs feeding later inputs) and suffer high latency when using step-by-step decision trees. Researchers need LLMs that both understand API coupling and answer queries quickly.
Main Contribution
SoAy: a two-step method where the LLM first outputs a compact API calling plan (solution) and then generates executable code guided by that plan.
SoAyBench: a publicly released dataset of 3,960 (Query, Solution, Code) triplets built from an AMiner API clone and a test set of 792 fixed questions.
Key Findings
SoAyLLaMA (Code-13B) achieved the top automated score on SoAyBench.
SoAyGPT with GPT-4 outperforms baselines on correctness.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Top automated score (SoAyLLaMA Code-13B) | 92.74% overall Score | ToolLLaMA / GPT-DFSDT | ± (See Table 3) | SoAyBench | Table 3 reports Code-13B Score 92.74% | Table 3 |
| SoAyGPT (GPT-4) automated score | 86.57% overall Score | GPT-DFSDT (GPT-4) Score 58.16% | +28.41 percentage points | SoAyBench | Table 3 rows for SoAyGPT and GPT-DFSDT with GPT-4 | Table 3 |
What To Try In 7 Days
Clone a small subset of your domain APIs and enumerate dependency paths to create simple solutions.
Generate a handful of (query, solution, code) examples and test in-context prompting with a closed LLM (SoAyGPT-style).
Fine-tune a small code-capable model (CodeLlama-7B) on the triplets to evaluate latency and accuracy trade-offs.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Method depends on having structured domain APIs and their coupling graph; no benefit when APIs are absent.
Performance gains are smaller with weaker instruction-following models (GPT-3.5 backbones showed less improvement).
When Not To Use
Your domain lacks stable, queryable APIs or snapshots to verify code execution.
You need open-ended or highly ambiguous answers that require model world knowledge rather than exact API facts.
Failure Modes
Wrong solution selection (model plans incorrect API sequence) leading to incorrect answers.
Correct solution but buggy generated code (WC) causing wrong outputs despite executable code.

