Overview
Practical toolkit: dataset, pipeline, and automatic metrics let teams prototype KG-based attribution quickly; results are meaningful but limited to biographies, simple triple KGs, and automatic evaluators.
Citations22
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 45%
Novelty: 60%
Why It Matters For Business
Attributing LLM outputs to structured KGs and marking missing facts ([NA]) makes generated content more verifiable and helps reduce risk in finance, law, and healthcare where factual traceability matters.
Who Should Care
Summary TLDR
This paper defines KaLMA, a task and benchmark for attributing LLM answers to structured knowledge graphs (KGs). It releases BioKaLMA (1,085 biography QA items with per-question minimum KG), a baseline retrieval→rerank→generate pipeline, and automatic evaluation that scores text quality (G-Eval), citation correctness/precision/recall, and text–citation alignment (NLI). Experiments show GPT-4 leads but no model exceeds ~40 micro F1 on citation quality; retrieval accuracy strongly controls recall; and a new 'Conscious Incompetence' mark ([NA]) helps flag missing KG facts but has limited recall (~15%).
Problem Statement
LLMs hallucinate facts. Prior attribution benchmarks use documents, ignore structured KGs, and assume the retrieval source fully covers needed facts. There is no reference-free, automatic way to score KG-based citations or to let models signal when required facts are missing.
Main Contribution
Define KaLMA: attribute LLM outputs to knowledge graphs and allow sentences to cite triples or mark missing knowledge ([NA])
Introduce 'Conscious Incompetence' setting so models can mark claims needing support not present in the KG
Key Findings
Benchmark size and scope
Citation quality ceiling across models
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Best micro F1 (citation) | 39.4 (GPT-4, specific questions) | — | — | BioKaLMA specific | Table 3; §5.1 | Table 3 |
| Micro citation correctness | 97.6 (GPT-4) | ≈95.5 (gold overall) | — | BioKaLMA specific | Table 3; Table 5 | Table 3 |
What To Try In 7 Days
Run the retrieval→re-rank→generate pipeline on a small domain KG and inspect citations.
Measure citation precision/recall using NLI alignment to find missed facts.
Add [NA] marking to outputs and audit whether flagged claims map to missing KG facts.
Agent Features
Tool Use
Reproducibility
Risks & Boundaries
Limitations
Only simple triple-based KGs where nodes are entities; other KG formats not studied
Text quality scoring uses text‑davinci-003 (G-Eval) which may bias evaluations toward certain model styles
When Not To Use
When your knowledge source is not a triple-based KG (e.g., long documents as KG nodes)
When human-verified ground-truth answers are required for evaluation
Failure Modes
High correctness but low recall: models omit required KG facts
Poor retrieval yields large drops in recall even when correctness stays high

