Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
KDCM can cut prompt-induced factual errors and boost answer accuracy and verifiability on QA and reasoning tasks, but it raises inference cost and needs structured knowledge.
Summary TLDR
This paper introduces KDCM, a chain-style knowledge-distillation method that embeds small executable code modules in reasoning prompts to guide knowledge-graph traversal. The code module constrains intermediate steps and supplies structured facts during inference. Experiments with GPT-4 and LLaMA 3.3 on WebQSP, CWQ, GSM8K, MWP, and Dr. SPIDER report large HIT@K gains (average HIT@1 98.4%, HIT@3 96.8%, HIT@5 95.5%). The method improves robustness to ambiguous prompts but raises inference complexity and requires access to structured knowledge.
Problem Statement
LLMs often produce fluent but false answers when prompts are ambiguous. Existing fixes (retrieval, verification) add cost or don't control internal multi-step reasoning. The paper aims to reduce prompt-induced hallucinations by explicitly constraining intermediate reasoning with executable code that explores knowledge graphs.
Main Contribution
Design of KDCM: a knowledge-distillation chain that embeds executable code in prompts to guide knowledge-graph exploration.
A prompt workflow that produces validated intermediate steps and a final grounded answer.
Empirical evaluation on public QA and reasoning datasets using GPT-4 and LLaMA 3.3 showing large HIT@K gains versus baselines.
Analysis of robustness to varied prompt formulations and discussion of deployment costs and knowledge requirements.
Key Findings
Adding the code-guided module raised WebQSP HIT@1 from 82.36% to 99.33%
Average performance across tested datasets reached HIT@1 98.4%, HIT@3 96.83%, HIT@5 95.51%
Method remains robust under prompt variation but increases inference complexity
Results
HIT@1 (average across datasets)
WebQSP HIT@1
GSM8K HIT@1
Generalization HIT@1
Who Should Care
What To Try In 7 Days
Prototype embedding a small executable code block that queries a domain KG into your prompt.
Run A/B tests on a small QA set and measure HIT@1/3/5 vs your current prompt strategy.
Compare against your existing retrieval pipeline to check accuracy vs latency trade-offs.
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Higher inference complexity and additional runtime cost versus plain prompting.
- Depends on access to suitable structured knowledge graphs and code representations.
- Validated mainly on text QA/reasoning datasets; open-ended and multimodal behavior is untested.
When Not To Use
- When you lack reliable structured knowledge for the domain.
- When strict latency or compute budgets prevent extra inference steps.
- For open-ended creative text generation where grounding is unnecessary.
Failure Modes
- Incomplete or wrong knowledge graphs can still lead to incorrect conclusions.
- Bugs or mis-specified code modules can constrain reasoning incorrectly.
- Over-reliance on external structure may break performance in domains without adequate graphs.
Core Entities
Models
- GPT-4
- LLaMA 3.3
Metrics
- HIT@1
- HIT@3
- HIT@5
Datasets
- WebQuestionsSP
- CWQ
- GSM8K
- MWP
- Dr. SPIDER
Benchmarks
- HIT@1
- HIT@3
- HIT@5

