Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
If you use LLMs to propose scientific leads, add groundedness checks to reduce wasted lab time; grounded-score filtering raises true-positive rates and speeds expert triage.
Summary TLDR
The paper introduces TruthHypo, a biomedical benchmark for testing whether LLMs generate truthful scientific hypotheses, and KnowHD, a detector that checks if hypotheses (and their reasoning steps) are grounded in existing literature or a knowledge graph. Evaluations across Llama-3 and GPT-4 families show models often find links but mislabel relations; larger models do better. KnowHD groundedness scores correlate with higher truthfulness and can be used to pick better hypotheses from multiple candidates. Data and code are released on GitHub.
Problem Statement
LLMs can propose plausible scientific hypotheses, but many are unsupported or hallucinated. Validating hypotheses is slow and costly, so we need automatic, practical ways to measure truthfulness and to filter hallucinated proposals before human or lab follow-up.
Main Contribution
TruthHypo: a biomedical benchmark (Chemical–Gene, Disease–Gene, Gene–Gene) split by publication year to simulate unseen discoveries.
KnowHD: a claim-level groundedness detector that verifies atomic claims using PubMed (BM25) and a PubTator knowledge graph.
Comprehensive evaluation across Llama-3 and GPT-4 model variants showing groundedness scores help select more truthful hypotheses.
Key Findings
Most tested LLMs struggle to predict exact relation types even when they spot links.
High groundedness scores from KnowHD correlate with higher truthfulness.
Groundedness-based selection improves hypothesis accuracy over simple baselines when external knowledge is used.
Human expert judgments match KnowHD’s groundedness signal in open-ended tasks.
Combining literature and knowledge-graph contexts yields higher groundedness than using either alone.
Results
Accuracy
Accuracy
Human selection preference (open-ended study)
Who Should Care
What To Try In 7 Days
Run your LLM to generate 3–5 hypothesis candidates per prompt and compute KnowHD groundedness scores.
Filter or rank candidates by groundedness before human review; expect ~10% higher hit rate on validated tasks.
Combine a lightweight BM25 retriever over your domain corpus with a small KG to support claim verification.
Agent Features
Tool Use
- retrieval (BM25)
- knowledge-graph lookup
Optimization Features
Token Efficiency
- RAG increases input token counts drastically (see Table 5)
Reproducibility
Data Urls
- https://github.com/Teddy-XiongGZ/TruthHypo
- PubTator 3.0 (used to build TruthHypo)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- TruthHypo focuses on biomedical relation prediction (Chemical–Gene, Disease–Gene, Gene–Gene) and may not generalize to other fields.
- Grounding depends on the coverage of PubMed and PubTator; missing literature or KG entries limit verification.
- BM25 retrieval favors exact-term matches; semantic misses can reduce groundedness scores.
When Not To Use
- For domains without a suitable literature corpus or knowledge graph.
- When you need novel hypotheses that, by definition, have no prior literature support.
- If token cost or latency from large retrieval contexts is prohibitive.
Failure Modes
- False negatives: verifier fails to find supporting text even when evidence exists due to retrieval gaps.
- Echoing/context parroting: model repeats provided context without real reasoning but still gets high groundedness.
- Smaller models can be disrupted by added context and perform worse after augmentation.
Core Entities
Models
- Llama-3.1-8B
- Llama-3.1-70B
- GPT-4o-mini
- GPT-4o
Metrics
- precision
- recall
- F1
- Accuracy
- KnowHD groundedness score
Datasets
- TruthHypo (this paper)
- PubTator 3.0
- PubMed corpus
- Qi et al. open-ended hypothesis dataset (used in human study)
Benchmarks
- TruthHypo

