LLMs often produce executable but unsafe Java code — GPT-4 had ~62% API misuse on StackOverflow-style questions.

Overview

Decision SnapshotNeeds Validation

The paper provides an open dataset and clear AST-based checks with multi-model experiments, making findings actionable for engineering teams.

Citations6

Evidence Strength0.70

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 30%

Novelty: 50%

Authors

Li Zhong, Zilong Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLM snippets can run but still be unsafe: unchecked API misuse can cause crashes, leaks, or data loss if pushed to production, so teams must verify LLM code before deployment.

Who Should Care

CTO Engineering Lead ML Engineer Data Scientist Founder

Summary TLDR

The authors release ROBUSTAPI, a benchmark of 1,208 Stack Overflow Java questions targeting 18 APIs and 41 documented API-usage rules. They build an AST-based checker to detect API misuse (missing checks, wrong call order, missing close, etc.). Evaluating GPT-3.5, GPT-4, Llama-2, Vicuna-1.5 and code-specialized models shows widespread API misuses: among compilable answers 57–70% contain misuse; GPT-4 shows ~62% overall misuse in zero-shot. A one-shot prompt with a correct usage example reduces misuse for some models. The dataset and checker are open-sourced.

Problem Statement

Existing code benchmarks focus on functional correctness (does code run) but not on long-term reliability. LLMs can output executable code that misuses APIs (e.g., missing exception handling, not closing resources), which can cause crashes, leaks, or data loss in production. The paper fills this evaluation gap with a dataset and static checker.

Main Contribution

ROBUSTAPI dataset: 1,208 real Stack Overflow Java questions covering 18 representative APIs and 41 API-usage rules.

An AST-based API-usage checker that detects misuses by comparing extracted call/control sequences to documented rules.

Key Findings

Most compilable LLM answers contain API misuses.

Numbers57–70% misuse among compilable answers (evaluated models, zero/one-shot)

Practical UseTreat LLM-generated code as risky by default and add automated API-usage checks before deploying.

Evidence RefTable 2; Figure 3

Even top models produce large absolute misuse counts.

NumbersGPT-4 overall misuse ~62.09% (zero-shot)

Practical UseDon’t assume higher-generation LLMs eliminate API misuse; verify outputs with static analyzers.

Evidence RefTable 2 (GPT-4, zero-shot)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
API Misuse Rate (compilable answers)	57–70%	—	—	ROBUSTAPI (various models, zero/one-shot)	Range across evaluated models in Table 2	Table 2
GPT-4 overall misuse	62.09% overall misuse	—	—	ROBUSTAPI (GPT-4, zero-shot)	GPT-4 zero-shot overall misuse in Table 2	Table 2

What To Try In 7 Days

Run an AST or static API-usage checker on LLM outputs for critical APIs.

Add one correct, short example of the target API to prompts when generating production code.

Treat LLM outputs as draft code: require review, tests, and resource-usage checks before merging.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/FloridSleeves/RobustAPI

Data URLs

https://github.com/FloridSleeves/RobustAPI

Risks & Boundaries

Limitations

ROBUSTAPI only covers Java and 18 APIs; results may not generalize to other languages.

Dataset construction filters for questions whose human answers contained misuse, causing selection bias toward problematic cases.

When Not To Use

When evaluating runtime behavior that requires execution-based tests rather than static API rules.

For languages other than Java without ported rule sets.

Failure Modes

Generating non-compilable snippets that need repair.

Omitting exception handling and guard checks.

Core Entities

Models

GPT-3.5GPT-4Llama-2Vicuna-1.5ds-coder-6.7b-baseds-coder-6.7b-instruct

Metrics

API Misuse RateCompilation RateOverall API Misuse PercentagePass@k

Datasets

ROBUSTAPIExampleCheck (source dataset)

Benchmarks

ROBUSTAPI

Context Entities

Models

CopilotCode-specialized LLMs (general mention)

Datasets

HumanEvalStack Overflow (source of questions)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Most compilable LLM answers contain API misuses.

Even top models produce large absolute misuse counts.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

Execution-driven, real-world benchmark for secure code generation across 5 languages

Key finding

SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Key finding