KaLMA + BioKaLMA: benchmark and metrics to attribute LLM outputs to knowledge graphs

Overview

Decision SnapshotNeeds Validation

Practical toolkit: dataset, pipeline, and automatic metrics let teams prototype KG-based attribution quickly; results are meaningful but limited to biographies, simple triple KGs, and automatic evaluators.

Citations22

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 45%

Novelty: 60%

Authors

Xinze Li, Yixin Cao, Liangming Pan, Yubo Ma, Aixin Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Attributing LLM outputs to structured KGs and marking missing facts ([NA]) makes generated content more verifiable and helps reduce risk in finance, law, and healthcare where factual traceability matters.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

This paper defines KaLMA, a task and benchmark for attributing LLM answers to structured knowledge graphs (KGs). It releases BioKaLMA (1,085 biography QA items with per-question minimum KG), a baseline retrieval→rerank→generate pipeline, and automatic evaluation that scores text quality (G-Eval), citation correctness/precision/recall, and text–citation alignment (NLI). Experiments show GPT-4 leads but no model exceeds ~40 micro F1 on citation quality; retrieval accuracy strongly controls recall; and a new 'Conscious Incompetence' mark ([NA]) helps flag missing KG facts but has limited recall (~15%).

Problem Statement

LLMs hallucinate facts. Prior attribution benchmarks use documents, ignore structured KGs, and assume the retrieval source fully covers needed facts. There is no reference-free, automatic way to score KG-based citations or to let models signal when required facts are missing.

Main Contribution

Define KaLMA: attribute LLM outputs to knowledge graphs and allow sentences to cite triples or mark missing knowledge ([NA])

Introduce 'Conscious Incompetence' setting so models can mark claims needing support not present in the KG

Key Findings

Benchmark size and scope

Numbers1,085 entries; avg 6.8 KG facts per question

Practical UseUse BioKaLMA for prototyping KG-based attribution workflows at small-to-medium scale.

Evidence Ref§2.3

Citation quality ceiling across models

NumbersBest micro F1 = 39.4 (GPT‑4 on specific questions)

Practical UseExpect current LLMs to require further engineering to produce high-coverage, accurate citations from KGs.

Evidence RefTable 3, §5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Best micro F1 (citation)	39.4 (GPT-4, specific questions)	—	—	BioKaLMA specific	Table 3; §5.1	Table 3
Micro citation correctness	97.6 (GPT-4)	≈95.5 (gold overall)	—	BioKaLMA specific	Table 3; Table 5	Table 3

What To Try In 7 Days

Run the retrieval→re-rank→generate pipeline on a small domain KG and inspect citations.

Measure citation precision/recall using NLI alignment to find missed facts.

Add [NA] marking to outputs and audit whether flagged claims map to missing KG facts.

Agent Features

Tool Use

Uses KG retrieval + LLM generation

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/lixinze777/Knowledge-aware-Language-Model-Attribution

Data URLs

https://github.com/lixinze777/Knowledge-aware-Language-Model-Attribution

Risks & Boundaries

Limitations

Only simple triple-based KGs where nodes are entities; other KG formats not studied

Text quality scoring uses text‑davinci-003 (G-Eval) which may bias evaluations toward certain model styles

When Not To Use

When your knowledge source is not a triple-based KG (e.g., long documents as KG nodes)

When human-verified ground-truth answers are required for evaluation

Failure Modes

High correctness but low recall: models omit required KG facts

Poor retrieval yields large drops in recall even when correctness stays high

Core Entities

Models

GPT-4 (gpt-4-0314)ChatGPT (gpt-3.5-turbo-0301)LLaMA-7BLLaMA-13BAlpaca-7BVicuna-13B

Metrics

G-Eval (text quality: coherence/consistency/fluency/relevance)Citation correctness / precision / recall / F1Alignment via TRUE NLI

Datasets

BioKaLMAWikiDataBiographical database (Plum et al.)

Benchmarks

KaLMA (this work)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Benchmark size and scope

Citation quality ceiling across models

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding