Overview
The paper provides an open dataset and clear AST-based checks with multi-model experiments, making findings actionable for engineering teams.
Citations6
Evidence Strength0.70
Confidence0.86
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 30%
Novelty: 50%
Why It Matters For Business
LLM snippets can run but still be unsafe: unchecked API misuse can cause crashes, leaks, or data loss if pushed to production, so teams must verify LLM code before deployment.
Who Should Care
Summary TLDR
The authors release ROBUSTAPI, a benchmark of 1,208 Stack Overflow Java questions targeting 18 APIs and 41 documented API-usage rules. They build an AST-based checker to detect API misuse (missing checks, wrong call order, missing close, etc.). Evaluating GPT-3.5, GPT-4, Llama-2, Vicuna-1.5 and code-specialized models shows widespread API misuses: among compilable answers 57–70% contain misuse; GPT-4 shows ~62% overall misuse in zero-shot. A one-shot prompt with a correct usage example reduces misuse for some models. The dataset and checker are open-sourced.
Problem Statement
Existing code benchmarks focus on functional correctness (does code run) but not on long-term reliability. LLMs can output executable code that misuses APIs (e.g., missing exception handling, not closing resources), which can cause crashes, leaks, or data loss in production. The paper fills this evaluation gap with a dataset and static checker.
Main Contribution
ROBUSTAPI dataset: 1,208 real Stack Overflow Java questions covering 18 representative APIs and 41 API-usage rules.
An AST-based API-usage checker that detects misuses by comparing extracted call/control sequences to documented rules.
Key Findings
Most compilable LLM answers contain API misuses.
Even top models produce large absolute misuse counts.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| API Misuse Rate (compilable answers) | 57–70% | — | — | ROBUSTAPI (various models, zero/one-shot) | Range across evaluated models in Table 2 | Table 2 |
| GPT-4 overall misuse | 62.09% overall misuse | — | — | ROBUSTAPI (GPT-4, zero-shot) | GPT-4 zero-shot overall misuse in Table 2 | Table 2 |
What To Try In 7 Days
Run an AST or static API-usage checker on LLM outputs for critical APIs.
Add one correct, short example of the target API to prompts when generating production code.
Treat LLM outputs as draft code: require review, tests, and resource-usage checks before merging.
Reproducibility
Risks & Boundaries
Limitations
ROBUSTAPI only covers Java and 18 APIs; results may not generalize to other languages.
Dataset construction filters for questions whose human answers contained misuse, causing selection bias toward problematic cases.
When Not To Use
When evaluating runtime behavior that requires execution-based tests rather than static API rules.
For languages other than Java without ported rule sets.
Failure Modes
Generating non-compilable snippets that need repair.
Omitting exception handling and guard checks.

