Overview
The study uses manual labels and standard benchmarks; taxonomy and mitigation experiments are reproducible but limited to selected models and datasets.
Citations30
Evidence Strength0.85
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/8
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
Hallucinations in LLM-generated code often break functionality and raise debugging, maintenance, and security costs; detecting and reducing them yields outsized gains in correctness without retraining models.
Who Should Care
Summary TLDR
The paper builds a practical taxonomy of "code hallucinations" — explicit semantic conflicts between generated code and stated facts (requirements, surrounding code, or real-world knowledge). The authors manually analyze 3,120 LLM outputs (HumanEval and CoderEval), find 1,134 samples containing hallucinations (1,212 hallucinatory snippets), show Requirement/Behavior conflicts are most common, trace causes to model limits and prompt gaps, quantify impacts (95% lead to incorrect functionality), and demonstrate that prompt-based fixes (Self-Refine, Chain-of-Thought, and RAG) can reduce targeted hallucinations but can also reduce pass@1 for smaller models.
Problem Statement
LLM code generators often produce outputs that break explicit facts from requirements, project context, or real-world knowledge. Existing work focuses on generic code errors or natural-language hallucinations but lacks a focused, evidence-backed taxonomy, distributional analysis, cause/impact study, and practical mitigation guidance for hallucinations specific to code generation.
Main Contribution
A reproducible, manually curated taxonomy of code hallucinations (3 primary categories, 12 leaf types).
A statistical study of 3,120 LLM-generated programs across HumanEval and CoderEval, reporting prevalence, model differences, and impact on correctness.
Key Findings
Code hallucinations occur often: 1,134 of 3,120 samples contained hallucinations.
Three top-level hallucination shares: Requirement Conflicting 39.60%, Knowledge 34.90%, Code Inconsistency 25.50%.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Total generated samples | 3,120 | — | — | HumanEval + CoderEval | Section IV-A; Table I | — |
| Hallucinatory samples | 1,134 (36.35%) | — | — | All datasets | Manual labeling: 1,134 samples contained hallucination | Section IV-A |
What To Try In 7 Days
Add a lightweight hallucination checklist: validate requirement alignment, undefined vars, and library calls on every generated snippet.
Use retrieval augmentation (RAG) to inject project files or APIs before generation for repo-level tasks.
Filter LLM outputs by a simple heuristic: reject outputs that contradict explicit requirement strings or reference unknown APIs; run unit tests early.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Study limited to Python and Java tasks from HumanEval and CoderEval.
Manual annotation is subjective despite strong inter-annotator agreement.
When Not To Use
As a sole quality metric for production release — hallucination detection complements, not replaces, unit tests.
For languages or domains outside Python/Java without validation — taxonomy distribution may differ.
Failure Modes
Prompt-enhancement (CoT or Self-Refine) can lower pass@1 on smaller models due to error propagation.
Subtle 'Inconsistent Libraries' hallucinations may pass unit tests locally but cause runtime failures in different environments.

