Overview
Production Readiness
0.8
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
SoAy converts complex multi-API academic queries into a short plan plus executable code, cutting latency and improving answer reliability—useful for search services, institutional dashboards, and any product that needs precise, API-backed facts.
Summary TLDR
SoAy is a practical method that teaches LLMs to (1) generate a small API call plan (a "solution") and (2) produce executable code that implements that plan. The authors auto-build a dataset (SoAyBench, 3,960 triplets) by enumerating API dependency paths in AMiner, use it to align LLMs both via fine-tuning (SoAyLLaMA) and in-context prompting (SoAyGPT), and evaluate with a custom SoAyEval metric. On the cloned AMiner API testbed, SoAy variants cut inference time and substantially raise correct-answer rates versus prior tool-using baselines; it is already deployed in production and served tens of thousands of requests.
Problem Statement
Existing LLMs that call external academic APIs fail when API calls must be tightly coupled (outputs feeding later inputs) and suffer high latency when using step-by-step decision trees. Researchers need LLMs that both understand API coupling and answer queries quickly.
Main Contribution
SoAy: a two-step method where the LLM first outputs a compact API calling plan (solution) and then generates executable code guided by that plan.
SoAyBench: a publicly released dataset of 3,960 (Query, Solution, Code) triplets built from an AMiner API clone and a test set of 792 fixed questions.
SoAyEval: a fine-grained evaluation protocol that checks solution correctness, code execution, and answer correctness (EM, DS, WS, WC, EE).
Two alignment recipes: SoAyGPT (prompted, for closed models) and SoAyLLaMA (supervised fine-tuning for open models).
Deployment and user study showing practical usability and lower latency compared to multi-step baselines.
Key Findings
SoAyLLaMA (Code-13B) achieved the top automated score on SoAyBench.
SoAyGPT with GPT-4 outperforms baselines on correctness.
SoAy reduces average response time versus multi-step decision-tree baselines.
Authors created a sizeable training/test resource automatically.
In online human evaluation, SoAy answers were preferred over raw GPT-4 and human experts for exact queries.
Results
Top automated score (SoAyLLaMA Code-13B)
SoAyGPT (GPT-4) automated score
Average response time
Dataset size
Deployment usage
Who Should Care
What To Try In 7 Days
Clone a small subset of your domain APIs and enumerate dependency paths to create simple solutions.
Generate a handful of (query, solution, code) examples and test in-context prompting with a closed LLM (SoAyGPT-style).
Fine-tune a small code-capable model (CodeLlama-7B) on the triplets to evaluate latency and accuracy trade-offs.
Agent Features
Memory
- short-term context of solution and code (no long-term retrieval described)
Planning
- solution generation (API call planning)
Tool Use
- API-aware code generation
- single-execution code-run for answers
Frameworks
- SoAyGPT agent prompts
- SoAyLLaMA fine-tuning pipeline
Is Agentic
true
Architectures
- in-context multi-agent prompting (Solution/Code/Answer agents)
- fine-tuned sequence-to-sequence LLMs (SoAyLLaMA)
Collaboration
- modular agents: solution, code, answer
Optimization Features
Token Efficiency
- plans reduce repeated reasoning calls, saving token usage
Training Optimization
- SFT
Inference Optimization
- reduce multi-step LLM calls by generating and executing code once
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Method depends on having structured domain APIs and their coupling graph; no benefit when APIs are absent.
- Performance gains are smaller with weaker instruction-following models (GPT-3.5 backbones showed less improvement).
- SoAyBench initially biases prompts toward two-hop solutions, which can skew models.
- Ambiguous, subjective questions are not improved by API-backed exact-answer generation.
When Not To Use
- Your domain lacks stable, queryable APIs or snapshots to verify code execution.
- You need open-ended or highly ambiguous answers that require model world knowledge rather than exact API facts.
- Low-latency constraints where code execution overhead exceeds budget and no lightweight model is available.
Failure Modes
- Wrong solution selection (model plans incorrect API sequence) leading to incorrect answers.
- Correct solution but buggy generated code (WC) causing wrong outputs despite executable code.
- Execution errors (EE) due to non-executable code, network, or API changes.
- Prompt bias from training data that favors certain hop counts or patterns.
Core Entities
Models
- gpt-3.5-turbo-0613
- gpt-3.5-turbo-16k-0613
- gpt-4-0613
- Llama-2-7b-chat-hf
- CodeLlama-7b-Instruct-hf
- CodeLlama-13b-Instruct-hf
Metrics
- SoAyEval
- EM
- DS
- WS
- WC
- EE
- ACC
- Score
- Response time (s)
Datasets
- SoAyBench (3,960 triplets)
- AMiner cloned API subset (snapshot Sep 23, 2023)
Benchmarks
- SoAyBench

