Taxonomy and lightweight mitigation of hallucinations in LLM-generated code

Overview

Decision SnapshotReady For Pilot

The study uses manual labels and standard benchmarks; taxonomy and mitigation experiments are reproducible but limited to selected models and datasets.

Citations30

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Fang Liu, Yang Liu, Lin Shi, Zhen Yang, Li Zhang, Xiaoli Lian, Zhongqi Li, Yuchi Ma

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Hallucinations in LLM-generated code often break functionality and raise debugging, maintenance, and security costs; detecting and reducing them yields outsized gains in correctness without retraining models.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager Data Scientist

Summary TLDR

The paper builds a practical taxonomy of "code hallucinations" — explicit semantic conflicts between generated code and stated facts (requirements, surrounding code, or real-world knowledge). The authors manually analyze 3,120 LLM outputs (HumanEval and CoderEval), find 1,134 samples containing hallucinations (1,212 hallucinatory snippets), show Requirement/Behavior conflicts are most common, trace causes to model limits and prompt gaps, quantify impacts (95% lead to incorrect functionality), and demonstrate that prompt-based fixes (Self-Refine, Chain-of-Thought, and RAG) can reduce targeted hallucinations but can also reduce pass@1 for smaller models.

Problem Statement

LLM code generators often produce outputs that break explicit facts from requirements, project context, or real-world knowledge. Existing work focuses on generic code errors or natural-language hallucinations but lacks a focused, evidence-backed taxonomy, distributional analysis, cause/impact study, and practical mitigation guidance for hallucinations specific to code generation.

Main Contribution

A reproducible, manually curated taxonomy of code hallucinations (3 primary categories, 12 leaf types).

A statistical study of 3,120 LLM-generated programs across HumanEval and CoderEval, reporting prevalence, model differences, and impact on correctness.

Key Findings

Code hallucinations occur often: 1,134 of 3,120 samples contained hallucinations.

Numbers1,134/3,120 samples; 1,212 hallucinatory snippets

Practical UseExpect hallucinations in roughly one third of generated solutions; add checks and human review when using LLMs for production code.

Evidence RefSection IV-A; taxonomy construction

Three top-level hallucination shares: Requirement Conflicting 39.60%, Knowledge 34.90%, Code Inconsistency 25.50%.

NumbersRequirement:39.60%; Knowledge:34.90%; Inconsistency:25.50%

Practical UseFocus first on requirement-following and project/context knowledge when validating generated code.

Evidence RefSection IV-B (taxonomy percentages)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Total generated samples	3,120	—	—	HumanEval + CoderEval	Section IV-A; Table I	—
Hallucinatory samples	1,134 (36.35%)	—	—	All datasets	Manual labeling: 1,134 samples contained hallucination	Section IV-A

What To Try In 7 Days

Add a lightweight hallucination checklist: validate requirement alignment, undefined vars, and library calls on every generated snippet.

Use retrieval augmentation (RAG) to inject project files or APIs before generation for repo-level tasks.

Filter LLM outputs by a simple heuristic: reject outputs that contradict explicit requirement strings or reference unknown APIs; run unit tests early.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Lorien1128/code_hallucination

Data URLs

HumanEval (public)CoderEval (public)

Risks & Boundaries

Limitations

Study limited to Python and Java tasks from HumanEval and CoderEval.

Manual annotation is subjective despite strong inter-annotator agreement.

When Not To Use

As a sole quality metric for production release — hallucination detection complements, not replaces, unit tests.

For languages or domains outside Python/Java without validation — taxonomy distribution may differ.

Failure Modes

Prompt-enhancement (CoT or Self-Refine) can lower pass@1 on smaller models due to error propagation.

Subtle 'Inconsistent Libraries' hallucinations may pass unit tests locally but cause runtime failures in different environments.

Core Entities

Models

CodeLlama-7BDeepSeek-Coder-1.3BDeepSeek-Coder-7BDeepSeek-R1 (671B)GPT-4 (gpt-4-0125-preview)

Metrics

pass@1hallucination prevalence (%)hallucinatory-snippet count

Datasets

HumanEvalCoderEval

Benchmarks

HumanEvalCoderEval

Context Entities

Metrics

Cohen's Kappa (annotation agreement)

Datasets

HumanEval (164 Python tasks)CoderEval (230 Python + 230 Java functions)

Benchmarks

HumanEvalCoderEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Code hallucinations occur often: 1,134 of 3,120 samples contained hallucinations.

Three top-level hallucination shares: Requirement Conflicting 39.60%, Knowledge 34.90%, Code Inconsistency 25.50%.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding