Taxonomy and lightweight mitigation of hallucinations in LLM-generated code

April 1, 20247 min

Overview

Decision SnapshotReady For Pilot

The study uses manual labels and standard benchmarks; taxonomy and mitigation experiments are reproducible but limited to selected models and datasets.

Citations30

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Fang Liu, Yang Liu, Lin Shi, Zhen Yang, Li Zhang, Xiaoli Lian, Zhongqi Li, Yuchi Ma

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Hallucinations in LLM-generated code often break functionality and raise debugging, maintenance, and security costs; detecting and reducing them yields outsized gains in correctness without retraining models.

Who Should Care

Summary TLDR

The paper builds a practical taxonomy of "code hallucinations" — explicit semantic conflicts between generated code and stated facts (requirements, surrounding code, or real-world knowledge). The authors manually analyze 3,120 LLM outputs (HumanEval and CoderEval), find 1,134 samples containing hallucinations (1,212 hallucinatory snippets), show Requirement/Behavior conflicts are most common, trace causes to model limits and prompt gaps, quantify impacts (95% lead to incorrect functionality), and demonstrate that prompt-based fixes (Self-Refine, Chain-of-Thought, and RAG) can reduce targeted hallucinations but can also reduce pass@1 for smaller models.

Problem Statement

LLM code generators often produce outputs that break explicit facts from requirements, project context, or real-world knowledge. Existing work focuses on generic code errors or natural-language hallucinations but lacks a focused, evidence-backed taxonomy, distributional analysis, cause/impact study, and practical mitigation guidance for hallucinations specific to code generation.

Main Contribution

A reproducible, manually curated taxonomy of code hallucinations (3 primary categories, 12 leaf types).

A statistical study of 3,120 LLM-generated programs across HumanEval and CoderEval, reporting prevalence, model differences, and impact on correctness.

Key Findings

Code hallucinations occur often: 1,134 of 3,120 samples contained hallucinations.

Numbers1,134/3,120 samples; 1,212 hallucinatory snippets

Practical UseExpect hallucinations in roughly one third of generated solutions; add checks and human review when using LLMs for production code.

Evidence RefSection IV-A; taxonomy construction

Three top-level hallucination shares: Requirement Conflicting 39.60%, Knowledge 34.90%, Code Inconsistency 25.50%.

NumbersRequirement:39.60%; Knowledge:34.90%; Inconsistency:25.50%

Practical UseFocus first on requirement-following and project/context knowledge when validating generated code.

Evidence RefSection IV-B (taxonomy percentages)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Total generated samples3,120HumanEval + CoderEvalSection IV-A; Table I
Hallucinatory samples1,134 (36.35%)All datasetsManual labeling: 1,134 samples contained hallucinationSection IV-A

What To Try In 7 Days

Add a lightweight hallucination checklist: validate requirement alignment, undefined vars, and library calls on every generated snippet.

Use retrieval augmentation (RAG) to inject project files or APIs before generation for repo-level tasks.

Filter LLM outputs by a simple heuristic: reject outputs that contradict explicit requirement strings or reference unknown APIs; run unit tests early.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

HumanEval (public)CoderEval (public)

Risks & Boundaries

Limitations

Study limited to Python and Java tasks from HumanEval and CoderEval.

Manual annotation is subjective despite strong inter-annotator agreement.

When Not To Use

As a sole quality metric for production release — hallucination detection complements, not replaces, unit tests.

For languages or domains outside Python/Java without validation — taxonomy distribution may differ.

Failure Modes

Prompt-enhancement (CoT or Self-Refine) can lower pass@1 on smaller models due to error propagation.

Subtle 'Inconsistent Libraries' hallucinations may pass unit tests locally but cause runtime failures in different environments.

Core Entities

Models

CodeLlama-7BDeepSeek-Coder-1.3BDeepSeek-Coder-7BDeepSeek-R1 (671B)GPT-4 (gpt-4-0125-preview)

Metrics

pass@1hallucination prevalence (%)hallucinatory-snippet count

Datasets

HumanEvalCoderEval

Benchmarks

HumanEvalCoderEval

Context Entities

Metrics

Cohen's Kappa (annotation agreement)

Datasets

HumanEval (164 Python tasks)CoderEval (230 Python + 230 Java functions)

Benchmarks

HumanEvalCoderEval