TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

May 20, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

0

Authors

Guangzhi Xiong, Eric Xie, Corey Williams, Myles Kim, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang

Links

Abstract / PDF

Why It Matters For Business

If you use LLMs to propose scientific leads, add groundedness checks to reduce wasted lab time; grounded-score filtering raises true-positive rates and speeds expert triage.

Summary TLDR

The paper introduces TruthHypo, a biomedical benchmark for testing whether LLMs generate truthful scientific hypotheses, and KnowHD, a detector that checks if hypotheses (and their reasoning steps) are grounded in existing literature or a knowledge graph. Evaluations across Llama-3 and GPT-4 families show models often find links but mislabel relations; larger models do better. KnowHD groundedness scores correlate with higher truthfulness and can be used to pick better hypotheses from multiple candidates. Data and code are released on GitHub.

Problem Statement

LLMs can propose plausible scientific hypotheses, but many are unsupported or hallucinated. Validating hypotheses is slow and costly, so we need automatic, practical ways to measure truthfulness and to filter hallucinated proposals before human or lab follow-up.

Main Contribution

TruthHypo: a biomedical benchmark (Chemical–Gene, Disease–Gene, Gene–Gene) split by publication year to simulate unseen discoveries.

KnowHD: a claim-level groundedness detector that verifies atomic claims using PubMed (BM25) and a PubTator knowledge graph.

Comprehensive evaluation across Llama-3 and GPT-4 model variants showing groundedness scores help select more truthful hypotheses.

Key Findings

Most tested LLMs struggle to predict exact relation types even when they spot links.

NumbersLink-level F1 ~75–83%; relation-level accuracy often ~40–66% (Table 2).

High groundedness scores from KnowHD correlate with higher truthfulness.

NumbersGPT-4o-mini avg accuracy 60.96% (KG+Lit) → 72.77% for hypotheses with groundedness >80%.

Groundedness-based selection improves hypothesis accuracy over simple baselines when external knowledge is used.

NumbersGPT-4o-mini accuracy 63.44% with groundedness selection (param+KG+Lit) vs lower for greedy/majority baselines (Fig.4).

Human expert judgments match KnowHD’s groundedness signal in open-ended tasks.

NumbersHigh-grounded group selection ratio: humans 59.26% vs low 40.74%; GPT-4o: 61.11% vs 38.89%; p-values ≪ 0.05.

Combining literature and knowledge-graph contexts yields higher groundedness than using either alone.

NumbersKG+Lit produced the highest groundedness percentages across tasks (Table 3 examples: combined scores ~76–86%).

Results

Accuracy

ValueGPT-4o avg accuracy 66.95% (best KG+Lit setting reported)

Accuracy

ValueAccuracy ↑ from 60.96% to 72.77%

BaselineAll hypotheses (KG+Lit) for GPT-4o-mini

Human selection preference (open-ended study)

ValueHigh-grounded selected 59.26% vs low-grounded 40.74%

Baselinelow-grounded group

Who Should Care

What To Try In 7 Days

Run your LLM to generate 3–5 hypothesis candidates per prompt and compute KnowHD groundedness scores.

Filter or rank candidates by groundedness before human review; expect ~10% higher hit rate on validated tasks.

Combine a lightweight BM25 retriever over your domain corpus with a small KG to support claim verification.

Agent Features

Tool Use

  • retrieval (BM25)
  • knowledge-graph lookup

Optimization Features

Token Efficiency

  • RAG increases input token counts drastically (see Table 5)

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • TruthHypo focuses on biomedical relation prediction (Chemical–Gene, Disease–Gene, Gene–Gene) and may not generalize to other fields.
  • Grounding depends on the coverage of PubMed and PubTator; missing literature or KG entries limit verification.
  • BM25 retrieval favors exact-term matches; semantic misses can reduce groundedness scores.

When Not To Use

  • For domains without a suitable literature corpus or knowledge graph.
  • When you need novel hypotheses that, by definition, have no prior literature support.
  • If token cost or latency from large retrieval contexts is prohibitive.

Failure Modes

  • False negatives: verifier fails to find supporting text even when evidence exists due to retrieval gaps.
  • Echoing/context parroting: model repeats provided context without real reasoning but still gets high groundedness.
  • Smaller models can be disrupted by added context and perform worse after augmentation.

Core Entities

Models

  • Llama-3.1-8B
  • Llama-3.1-70B
  • GPT-4o-mini
  • GPT-4o

Metrics

  • precision
  • recall
  • F1
  • Accuracy
  • KnowHD groundedness score

Datasets

  • TruthHypo (this paper)
  • PubTator 3.0
  • PubMed corpus
  • Qi et al. open-ended hypothesis dataset (used in human study)

Benchmarks

  • TruthHypo