TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Overview

Decision SnapshotReady For Pilot

The methods and data are released and results include human validation; the pipeline is ready for testing but expect token-cost and integration work for full production.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 70%

Authors

Guangzhi Xiong, Eric Xie, Corey Williams, Myles Kim, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you use LLMs to propose scientific leads, add groundedness checks to reduce wasted lab time; grounded-score filtering raises true-positive rates and speeds expert triage.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Founder

Summary TLDR

The paper introduces TruthHypo, a biomedical benchmark for testing whether LLMs generate truthful scientific hypotheses, and KnowHD, a detector that checks if hypotheses (and their reasoning steps) are grounded in existing literature or a knowledge graph. Evaluations across Llama-3 and GPT-4 families show models often find links but mislabel relations; larger models do better. KnowHD groundedness scores correlate with higher truthfulness and can be used to pick better hypotheses from multiple candidates. Data and code are released on GitHub.

Problem Statement

LLMs can propose plausible scientific hypotheses, but many are unsupported or hallucinated. Validating hypotheses is slow and costly, so we need automatic, practical ways to measure truthfulness and to filter hallucinated proposals before human or lab follow-up.

Main Contribution

TruthHypo: a biomedical benchmark (Chemical–Gene, Disease–Gene, Gene–Gene) split by publication year to simulate unseen discoveries.

KnowHD: a claim-level groundedness detector that verifies atomic claims using PubMed (BM25) and a PubTator knowledge graph.

Key Findings

Most tested LLMs struggle to predict exact relation types even when they spot links.

NumbersLink-level F1 ~75–83%; relation-level accuracy often ~40–66% (Table 2).

Practical UseDo not trust raw relation labels from LLM outputs; use secondary checks before committing resources.

Evidence RefTable 2, Section 4.2

High groundedness scores from KnowHD correlate with higher truthfulness.

NumbersGPT-4o-mini avg accuracy 60.96% (KG+Lit) → 72.77% for hypotheses with groundedness >80%.

Practical UseRank or filter candidate hypotheses by groundedness to raise the chance of true hypotheses by ~10–12 percentage points on evaluated tasks.

Evidence RefSection 4.3, Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4o avg accuracy 66.95% (best KG+Lit setting reported)	—	—	TruthHypo (combined tasks, Table 2)	Table 2 shows GPT-4o achieves mean accuracies exceeding 60% across settings	Section 4.2, Table 2
Accuracy	Accuracy ↑ from 60.96% to 72.77%	All hypotheses (KG+Lit) for GPT-4o-mini	+11.81 pp	TruthHypo (Chemical & Gene, GPT-4o-mini)	Section 4.3, Figure 3 text example	Section 4.3

What To Try In 7 Days

Run your LLM to generate 3–5 hypothesis candidates per prompt and compute KnowHD groundedness scores.

Filter or rank candidates by groundedness before human review; expect ~10% higher hit rate on validated tasks.

Combine a lightweight BM25 retriever over your domain corpus with a small KG to support claim verification.

Agent Features

Tool Use

retrieval (BM25)knowledge-graph lookup

Optimization Features

Token Efficiency

RAG increases input token counts drastically (see Table 5)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/Teddy-XiongGZ/TruthHypo

Data URLs

https://github.com/Teddy-XiongGZ/TruthHypoPubTator 3.0 (used to build TruthHypo)

Risks & Boundaries

Limitations

TruthHypo focuses on biomedical relation prediction (Chemical–Gene, Disease–Gene, Gene–Gene) and may not generalize to other fields.

Grounding depends on the coverage of PubMed and PubTator; missing literature or KG entries limit verification.

When Not To Use

For domains without a suitable literature corpus or knowledge graph.

When you need novel hypotheses that, by definition, have no prior literature support.

Failure Modes

False negatives: verifier fails to find supporting text even when evidence exists due to retrieval gaps.

Echoing/context parroting: model repeats provided context without real reasoning but still gets high groundedness.

Core Entities

Models

Llama-3.1-8BLlama-3.1-70BGPT-4o-miniGPT-4o

Metrics

precisionrecallF1AccuracyKnowHD groundedness score

Datasets

TruthHypo (this paper)PubTator 3.0PubMed corpusQi et al. open-ended hypothesis dataset (used in human study)

Benchmarks

TruthHypo

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Most tested LLMs struggle to predict exact relation types even when they spot links.

High groundedness scores from KnowHD correlate with higher truthfulness.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Key finding

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Key finding

KatotohananQA: Filipino TruthfulQA shows ~10–12% accuracy drop vs English; GPT‑5 is multilingual-robust

Key finding