LLMs often produce executable but unsafe Java code — GPT-4 had ~62% API misuse on StackOverflow-style questions.

August 20, 20237 min

Overview

Decision SnapshotNeeds Validation

The paper provides an open dataset and clear AST-based checks with multi-model experiments, making findings actionable for engineering teams.

Citations6

Evidence Strength0.70

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 30%

Novelty: 50%

Authors

Li Zhong, Zilong Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLM snippets can run but still be unsafe: unchecked API misuse can cause crashes, leaks, or data loss if pushed to production, so teams must verify LLM code before deployment.

Who Should Care

Summary TLDR

The authors release ROBUSTAPI, a benchmark of 1,208 Stack Overflow Java questions targeting 18 APIs and 41 documented API-usage rules. They build an AST-based checker to detect API misuse (missing checks, wrong call order, missing close, etc.). Evaluating GPT-3.5, GPT-4, Llama-2, Vicuna-1.5 and code-specialized models shows widespread API misuses: among compilable answers 57–70% contain misuse; GPT-4 shows ~62% overall misuse in zero-shot. A one-shot prompt with a correct usage example reduces misuse for some models. The dataset and checker are open-sourced.

Problem Statement

Existing code benchmarks focus on functional correctness (does code run) but not on long-term reliability. LLMs can output executable code that misuses APIs (e.g., missing exception handling, not closing resources), which can cause crashes, leaks, or data loss in production. The paper fills this evaluation gap with a dataset and static checker.

Main Contribution

ROBUSTAPI dataset: 1,208 real Stack Overflow Java questions covering 18 representative APIs and 41 API-usage rules.

An AST-based API-usage checker that detects misuses by comparing extracted call/control sequences to documented rules.

Key Findings

Most compilable LLM answers contain API misuses.

Numbers5770% misuse among compilable answers (evaluated models, zero/one-shot)

Practical UseTreat LLM-generated code as risky by default and add automated API-usage checks before deploying.

Evidence RefTable 2; Figure 3

Even top models produce large absolute misuse counts.

NumbersGPT-4 overall misuse ~62.09% (zero-shot)

Practical UseDon’t assume higher-generation LLMs eliminate API misuse; verify outputs with static analyzers.

Evidence RefTable 2 (GPT-4, zero-shot)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
API Misuse Rate (compilable answers)5770%ROBUSTAPI (various models, zero/one-shot)Range across evaluated models in Table 2Table 2
GPT-4 overall misuse62.09% overall misuseROBUSTAPI (GPT-4, zero-shot)GPT-4 zero-shot overall misuse in Table 2Table 2

What To Try In 7 Days

Run an AST or static API-usage checker on LLM outputs for critical APIs.

Add one correct, short example of the target API to prompts when generating production code.

Treat LLM outputs as draft code: require review, tests, and resource-usage checks before merging.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

ROBUSTAPI only covers Java and 18 APIs; results may not generalize to other languages.

Dataset construction filters for questions whose human answers contained misuse, causing selection bias toward problematic cases.

When Not To Use

When evaluating runtime behavior that requires execution-based tests rather than static API rules.

For languages other than Java without ported rule sets.

Failure Modes

Generating non-compilable snippets that need repair.

Omitting exception handling and guard checks.

Core Entities

Models

GPT-3.5GPT-4Llama-2Vicuna-1.5ds-coder-6.7b-baseds-coder-6.7b-instruct

Metrics

API Misuse RateCompilation RateOverall API Misuse PercentagePass@k

Datasets

ROBUSTAPIExampleCheck (source dataset)

Benchmarks

ROBUSTAPI

Context Entities

Models

CopilotCode-specialized LLMs (general mention)

Datasets

HumanEvalStack Overflow (source of questions)