KaLMA + BioKaLMA: benchmark and metrics to attribute LLM outputs to knowledge graphs

October 9, 20237 min

Overview

Decision SnapshotNeeds Validation

Practical toolkit: dataset, pipeline, and automatic metrics let teams prototype KG-based attribution quickly; results are meaningful but limited to biographies, simple triple KGs, and automatic evaluators.

Citations22

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 45%

Novelty: 60%

Authors

Xinze Li, Yixin Cao, Liangming Pan, Yubo Ma, Aixin Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Attributing LLM outputs to structured KGs and marking missing facts ([NA]) makes generated content more verifiable and helps reduce risk in finance, law, and healthcare where factual traceability matters.

Who Should Care

Summary TLDR

This paper defines KaLMA, a task and benchmark for attributing LLM answers to structured knowledge graphs (KGs). It releases BioKaLMA (1,085 biography QA items with per-question minimum KG), a baseline retrieval→rerank→generate pipeline, and automatic evaluation that scores text quality (G-Eval), citation correctness/precision/recall, and text–citation alignment (NLI). Experiments show GPT-4 leads but no model exceeds ~40 micro F1 on citation quality; retrieval accuracy strongly controls recall; and a new 'Conscious Incompetence' mark ([NA]) helps flag missing KG facts but has limited recall (~15%).

Problem Statement

LLMs hallucinate facts. Prior attribution benchmarks use documents, ignore structured KGs, and assume the retrieval source fully covers needed facts. There is no reference-free, automatic way to score KG-based citations or to let models signal when required facts are missing.

Main Contribution

Define KaLMA: attribute LLM outputs to knowledge graphs and allow sentences to cite triples or mark missing knowledge ([NA])

Introduce 'Conscious Incompetence' setting so models can mark claims needing support not present in the KG

Key Findings

Benchmark size and scope

Numbers1,085 entries; avg 6.8 KG facts per question

Practical UseUse BioKaLMA for prototyping KG-based attribution workflows at small-to-medium scale.

Evidence Ref§2.3

Citation quality ceiling across models

NumbersBest micro F1 = 39.4 (GPT‑4 on specific questions)

Practical UseExpect current LLMs to require further engineering to produce high-coverage, accurate citations from KGs.

Evidence RefTable 3, §5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Best micro F1 (citation)39.4 (GPT-4, specific questions)BioKaLMA specificTable 3; §5.1Table 3
Micro citation correctness97.6 (GPT-4)≈95.5 (gold overall)BioKaLMA specificTable 3; Table 5Table 3

What To Try In 7 Days

Run the retrieval→re-rank→generate pipeline on a small domain KG and inspect citations.

Measure citation precision/recall using NLI alignment to find missed facts.

Add [NA] marking to outputs and audit whether flagged claims map to missing KG facts.

Agent Features

Tool Use
Uses KG retrieval + LLM generation

Reproducibility

Risks & Boundaries

Limitations

Only simple triple-based KGs where nodes are entities; other KG formats not studied

Text quality scoring uses text‑davinci-003 (G-Eval) which may bias evaluations toward certain model styles

When Not To Use

When your knowledge source is not a triple-based KG (e.g., long documents as KG nodes)

When human-verified ground-truth answers are required for evaluation

Failure Modes

High correctness but low recall: models omit required KG facts

Poor retrieval yields large drops in recall even when correctness stays high

Core Entities

Models

GPT-4 (gpt-4-0314)ChatGPT (gpt-3.5-turbo-0301)LLaMA-7BLLaMA-13BAlpaca-7BVicuna-13B

Metrics

G-Eval (text quality: coherence/consistency/fluency/relevance)Citation correctness / precision / recall / F1Alignment via TRUE NLI

Datasets

BioKaLMAWikiDataBiographical database (Plum et al.)

Benchmarks

KaLMA (this work)