Embed executable code in prompts to ground LLM reasoning and cut hallucinations

Overview

Decision SnapshotNeeds Validation

The paper shows large empirical HIT@K gains on several public benchmarks, which suggests practical value for QA apps; however, increased inference cost and dependence on structured knowledge limit immediate production rollout.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Jinbo Hao, Kai Yang, Qingzhen Su, Yifan Li, Chao Jiang

Links

Abstract / PDF

Why It Matters For Business

KDCM can cut prompt-induced factual errors and boost answer accuracy and verifiability on QA and reasoning tasks, but it raises inference cost and needs structured knowledge.

Who Should Care

CTO ML Engineer Data Scientist Product Manager Engineering Lead

Summary TLDR

This paper introduces KDCM, a chain-style knowledge-distillation method that embeds small executable code modules in reasoning prompts to guide knowledge-graph traversal. The code module constrains intermediate steps and supplies structured facts during inference. Experiments with GPT-4 and LLaMA 3.3 on WebQSP, CWQ, GSM8K, MWP, and Dr. SPIDER report large HIT@K gains (average HIT@1 98.4%, HIT@3 96.8%, HIT@5 95.5%). The method improves robustness to ambiguous prompts but raises inference complexity and requires access to structured knowledge.

Problem Statement

LLMs often produce fluent but false answers when prompts are ambiguous. Existing fixes (retrieval, verification) add cost or don't control internal multi-step reasoning. The paper aims to reduce prompt-induced hallucinations by explicitly constraining intermediate reasoning with executable code that explores knowledge graphs.

Main Contribution

Design of KDCM: a knowledge-distillation chain that embeds executable code in prompts to guide knowledge-graph exploration.

A prompt workflow that produces validated intermediate steps and a final grounded answer.

Key Findings

Adding the code-guided module raised WebQSP HIT@1 from 82.36% to 99.33%

NumbersWebQSP HIT@1: 82.36% → 99.33% (+16.97 pp)

Practical UseEmbed a small executable step to ground answers — you can cut many prompt-induced errors on knowledge-QA tasks.

Evidence RefTable 1

Average performance across tested datasets reached HIT@1 98.4%, HIT@3 96.83%, HIT@5 95.51%

NumbersAverage (Ours) HIT@1=98.4%, HIT@3=96.83%, HIT@5=95.51%

Practical UseOn evaluated benchmarks, code-guided reasoning substantially improves top-k retrieval accuracy versus RAG and other baselines.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HIT@1 (average across datasets)	98.4%	KG-LLM-PR 91.06%	+7.34 pp	Table 2 average	Table 2 shows Average (Ours) 98.4% vs KG-LLM-PR 91.06%	Table 2
WebQSP HIT@1	99.33%	KDCM (no code) 82.36%	+16.97 pp	WebQSP	Table 1 WebQSP KDCM vs KDCM + Code Module	Table 1

What To Try In 7 Days

Prototype embedding a small executable code block that queries a domain KG into your prompt.

Run A/B tests on a small QA set and measure HIT@1/3/5 vs your current prompt strategy.

Compare against your existing retrieval pipeline to check accuracy vs latency trade-offs.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Higher inference complexity and additional runtime cost versus plain prompting.

Depends on access to suitable structured knowledge graphs and code representations.

When Not To Use

When you lack reliable structured knowledge for the domain.

When strict latency or compute budgets prevent extra inference steps.

Failure Modes

Incomplete or wrong knowledge graphs can still lead to incorrect conclusions.

Bugs or mis-specified code modules can constrain reasoning incorrectly.

Core Entities

Models

GPT-4LLaMA 3.3

Metrics

HIT@1HIT@3HIT@5

Datasets

WebQuestionsSPCWQGSM8KMWPDr. SPIDER

Benchmarks

HIT@1HIT@3HIT@5

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adding the code-guided module raised WebQSP HIT@1 from 82.36% to 99.33%

Average performance across tested datasets reached HIT@1 98.4%, HIT@3 96.83%, HIT@5 95.51%

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding