TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

May 20, 20257 min

Overview

Decision SnapshotReady For Pilot

The methods and data are released and results include human validation; the pipeline is ready for testing but expect token-cost and integration work for full production.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 70%

Authors

Guangzhi Xiong, Eric Xie, Corey Williams, Myles Kim, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you use LLMs to propose scientific leads, add groundedness checks to reduce wasted lab time; grounded-score filtering raises true-positive rates and speeds expert triage.

Who Should Care

Summary TLDR

The paper introduces TruthHypo, a biomedical benchmark for testing whether LLMs generate truthful scientific hypotheses, and KnowHD, a detector that checks if hypotheses (and their reasoning steps) are grounded in existing literature or a knowledge graph. Evaluations across Llama-3 and GPT-4 families show models often find links but mislabel relations; larger models do better. KnowHD groundedness scores correlate with higher truthfulness and can be used to pick better hypotheses from multiple candidates. Data and code are released on GitHub.

Problem Statement

LLMs can propose plausible scientific hypotheses, but many are unsupported or hallucinated. Validating hypotheses is slow and costly, so we need automatic, practical ways to measure truthfulness and to filter hallucinated proposals before human or lab follow-up.

Main Contribution

TruthHypo: a biomedical benchmark (Chemical–Gene, Disease–Gene, Gene–Gene) split by publication year to simulate unseen discoveries.

KnowHD: a claim-level groundedness detector that verifies atomic claims using PubMed (BM25) and a PubTator knowledge graph.

Key Findings

Most tested LLMs struggle to predict exact relation types even when they spot links.

NumbersLink-level F1 ~7583%; relation-level accuracy often ~4066% (Table 2).

Practical UseDo not trust raw relation labels from LLM outputs; use secondary checks before committing resources.

Evidence RefTable 2, Section 4.2

High groundedness scores from KnowHD correlate with higher truthfulness.

NumbersGPT-4o-mini avg accuracy 60.96% (KG+Lit) → 72.77% for hypotheses with groundedness >80%.

Practical UseRank or filter candidate hypotheses by groundedness to raise the chance of true hypotheses by ~10–12 percentage points on evaluated tasks.

Evidence RefSection 4.3, Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4o avg accuracy 66.95% (best KG+Lit setting reported)TruthHypo (combined tasks, Table 2)Table 2 shows GPT-4o achieves mean accuracies exceeding 60% across settingsSection 4.2, Table 2
AccuracyAccuracy ↑ from 60.96% to 72.77%All hypotheses (KG+Lit) for GPT-4o-mini+11.81 ppTruthHypo (Chemical & Gene, GPT-4o-mini)Section 4.3, Figure 3 text exampleSection 4.3

What To Try In 7 Days

Run your LLM to generate 3–5 hypothesis candidates per prompt and compute KnowHD groundedness scores.

Filter or rank candidates by groundedness before human review; expect ~10% higher hit rate on validated tasks.

Combine a lightweight BM25 retriever over your domain corpus with a small KG to support claim verification.

Agent Features

Tool Use
retrieval (BM25)knowledge-graph lookup

Optimization Features

Token Efficiency
RAG increases input token counts drastically (see Table 5)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

https://github.com/Teddy-XiongGZ/TruthHypoPubTator 3.0 (used to build TruthHypo)

Risks & Boundaries

Limitations

TruthHypo focuses on biomedical relation prediction (Chemical–Gene, Disease–Gene, Gene–Gene) and may not generalize to other fields.

Grounding depends on the coverage of PubMed and PubTator; missing literature or KG entries limit verification.

When Not To Use

For domains without a suitable literature corpus or knowledge graph.

When you need novel hypotheses that, by definition, have no prior literature support.

Failure Modes

False negatives: verifier fails to find supporting text even when evidence exists due to retrieval gaps.

Echoing/context parroting: model repeats provided context without real reasoning but still gets high groundedness.

Core Entities

Models

Llama-3.1-8BLlama-3.1-70BGPT-4o-miniGPT-4o

Metrics

precisionrecallF1AccuracyKnowHD groundedness score

Datasets

TruthHypo (this paper)PubTator 3.0PubMed corpusQi et al. open-ended hypothesis dataset (used in human study)

Benchmarks

TruthHypo