Overview
Production Readiness
0.3
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
6
Why It Matters For Business
LLM snippets can run but still be unsafe: unchecked API misuse can cause crashes, leaks, or data loss if pushed to production, so teams must verify LLM code before deployment.
Summary TLDR
The authors release ROBUSTAPI, a benchmark of 1,208 Stack Overflow Java questions targeting 18 APIs and 41 documented API-usage rules. They build an AST-based checker to detect API misuse (missing checks, wrong call order, missing close, etc.). Evaluating GPT-3.5, GPT-4, Llama-2, Vicuna-1.5 and code-specialized models shows widespread API misuses: among compilable answers 57–70% contain misuse; GPT-4 shows ~62% overall misuse in zero-shot. A one-shot prompt with a correct usage example reduces misuse for some models. The dataset and checker are open-sourced.
Problem Statement
Existing code benchmarks focus on functional correctness (does code run) but not on long-term reliability. LLMs can output executable code that misuses APIs (e.g., missing exception handling, not closing resources), which can cause crashes, leaks, or data loss in production. The paper fills this evaluation gap with a dataset and static checker.
Main Contribution
ROBUSTAPI dataset: 1,208 real Stack Overflow Java questions covering 18 representative APIs and 41 API-usage rules.
An AST-based API-usage checker that detects misuses by comparing extracted call/control sequences to documented rules.
Empirical evaluation of GPT-3.5, GPT-4, Llama-2, Vicuna-1.5 and two DeepSeekCoder variants under zero-shot and one-shot prompts.
Analysis of prompt effects: one-shot relevant examples reduce misuse for some models; irrelevant examples mainly increase compilability.
Key Findings
Most compilable LLM answers contain API misuses.
Even top models produce large absolute misuse counts.
A single correct example in the prompt can reduce misuse for some models.
Irrelevant one-shot examples mainly increase compilability, not correctness.
Temperature and supplying formal API rules did not reliably reduce misuse.
Results
API Misuse Rate (compilable answers)
GPT-4 overall misuse
Effect of one-shot-relevant (GPT-3.5)
Who Should Care
What To Try In 7 Days
Run an AST or static API-usage checker on LLM outputs for critical APIs.
Add one correct, short example of the target API to prompts when generating production code.
Treat LLM outputs as draft code: require review, tests, and resource-usage checks before merging.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- ROBUSTAPI only covers Java and 18 APIs; results may not generalize to other languages.
- Dataset construction filters for questions whose human answers contained misuse, causing selection bias toward problematic cases.
- Static AST checks may miss runtime-only issues or contextual constraints not encoded in rules.
- Models were evaluated with default hyperparameters; finer tuning or retrieval could change outcomes.
When Not To Use
- When evaluating runtime behavior that requires execution-based tests rather than static API rules.
- For languages other than Java without ported rule sets.
- When measuring purely functional correctness for algorithmic problems (use HumanEval-like benchmarks).
Failure Modes
- Generating non-compilable snippets that need repair.
- Omitting exception handling and guard checks.
- Failing to close resources (files, streams) causing leaks.
- Incorrect call order or missing required calls for API protocols.
Core Entities
Models
- GPT-3.5
- GPT-4
- Llama-2
- Vicuna-1.5
- ds-coder-6.7b-base
- ds-coder-6.7b-instruct
Metrics
- API Misuse Rate
- Compilation Rate
- Overall API Misuse Percentage
- Pass@k
Datasets
- ROBUSTAPI
- ExampleCheck (source dataset)
Benchmarks
- ROBUSTAPI
Context Entities
Models
- Copilot
- Code-specialized LLMs (general mention)
Datasets
- HumanEval
- Stack Overflow (source of questions)

