LLMs often produce executable but unsafe Java code — GPT-4 had ~62% API misuse on StackOverflow-style questions.

August 20, 20237 min

Overview

Production Readiness

0.3

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

6

Authors

Li Zhong, Zilong Wang

Links

Abstract / PDF

Why It Matters For Business

LLM snippets can run but still be unsafe: unchecked API misuse can cause crashes, leaks, or data loss if pushed to production, so teams must verify LLM code before deployment.

Summary TLDR

The authors release ROBUSTAPI, a benchmark of 1,208 Stack Overflow Java questions targeting 18 APIs and 41 documented API-usage rules. They build an AST-based checker to detect API misuse (missing checks, wrong call order, missing close, etc.). Evaluating GPT-3.5, GPT-4, Llama-2, Vicuna-1.5 and code-specialized models shows widespread API misuses: among compilable answers 57–70% contain misuse; GPT-4 shows ~62% overall misuse in zero-shot. A one-shot prompt with a correct usage example reduces misuse for some models. The dataset and checker are open-sourced.

Problem Statement

Existing code benchmarks focus on functional correctness (does code run) but not on long-term reliability. LLMs can output executable code that misuses APIs (e.g., missing exception handling, not closing resources), which can cause crashes, leaks, or data loss in production. The paper fills this evaluation gap with a dataset and static checker.

Main Contribution

ROBUSTAPI dataset: 1,208 real Stack Overflow Java questions covering 18 representative APIs and 41 API-usage rules.

An AST-based API-usage checker that detects misuses by comparing extracted call/control sequences to documented rules.

Empirical evaluation of GPT-3.5, GPT-4, Llama-2, Vicuna-1.5 and two DeepSeekCoder variants under zero-shot and one-shot prompts.

Analysis of prompt effects: one-shot relevant examples reduce misuse for some models; irrelevant examples mainly increase compilability.

Key Findings

Most compilable LLM answers contain API misuses.

Numbers57–70% misuse among compilable answers (evaluated models, zero/one-shot)

Even top models produce large absolute misuse counts.

NumbersGPT-4 overall misuse ~62.09% (zero-shot)

A single correct example in the prompt can reduce misuse for some models.

NumbersGPT-3.5 misuse drops from 62.97% to 38.56% (zero-shot → one-shot-relevant)

Irrelevant one-shot examples mainly increase compilability, not correctness.

NumbersGPT-3.5 compilable rate rises 79.14% → 91.06% with irrelevant shot; misuse rate does not improve

Temperature and supplying formal API rules did not reliably reduce misuse.

NumbersChanging T or adding rules did not lower misuse vs one-shot-relevant (see GPT-3.5 experiments)

Results

API Misuse Rate (compilable answers)

Value57–70%

GPT-4 overall misuse

Value62.09% overall misuse

Effect of one-shot-relevant (GPT-3.5)

ValueMisuse drops 62.97% → 38.56%

BaselineZero-shot GPT-3.5 misuse 62.97%

Who Should Care

What To Try In 7 Days

Run an AST or static API-usage checker on LLM outputs for critical APIs.

Add one correct, short example of the target API to prompts when generating production code.

Treat LLM outputs as draft code: require review, tests, and resource-usage checks before merging.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • ROBUSTAPI only covers Java and 18 APIs; results may not generalize to other languages.
  • Dataset construction filters for questions whose human answers contained misuse, causing selection bias toward problematic cases.
  • Static AST checks may miss runtime-only issues or contextual constraints not encoded in rules.
  • Models were evaluated with default hyperparameters; finer tuning or retrieval could change outcomes.

When Not To Use

  • When evaluating runtime behavior that requires execution-based tests rather than static API rules.
  • For languages other than Java without ported rule sets.
  • When measuring purely functional correctness for algorithmic problems (use HumanEval-like benchmarks).

Failure Modes

  • Generating non-compilable snippets that need repair.
  • Omitting exception handling and guard checks.
  • Failing to close resources (files, streams) causing leaks.
  • Incorrect call order or missing required calls for API protocols.

Core Entities

Models

  • GPT-3.5
  • GPT-4
  • Llama-2
  • Vicuna-1.5
  • ds-coder-6.7b-base
  • ds-coder-6.7b-instruct

Metrics

  • API Misuse Rate
  • Compilation Rate
  • Overall API Misuse Percentage
  • Pass@k

Datasets

  • ROBUSTAPI
  • ExampleCheck (source dataset)

Benchmarks

  • ROBUSTAPI

Context Entities

Models

  • Copilot
  • Code-specialized LLMs (general mention)

Datasets

  • HumanEval
  • Stack Overflow (source of questions)