GeneAgent: an LLM agent that queries biology databases to verify and improve gene‑set function explanations

May 25, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

2

Authors

Zhizheng Wang, Qiao Jin, Chih-Hsuan Wei, Shubo Tian, Po-Ting Lai, Qingqing Zhu, Chi-Ping Day, Christina Ross, Zhiyong Lu

Links

Abstract / PDF

Why It Matters For Business

GeneAgent reduces false functional claims by checking LLM outputs against curated biology databases, cutting manual validation time and producing more trustworthy gene‑set summaries for research pipelines.

Summary TLDR

GeneAgent is a language-agent built on GPT-4 that auto-calls curated biological databases to verify and edit its own gene-set analyses. On 1,106 gene sets from GO, NeST, and MSigDB, GeneAgent produces names and narratives closer to reference terms and reduces hallucinated functions. The system verifies individual claims against 18 domain databases via four APIs and flags or fixes unsupported claims. Manual checks and a seven‑set mouse melanoma case study show better relevance and new insights compared to vanilla GPT-4.

Problem Statement

Standard LLMs can suggest plausible but unsupported biological functions (hallucinations) for gene sets. Researchers need automated tools that combine LLM reasoning with factual checks against curated biomedical databases to produce reliable, interpretable gene‑set annotations.

Main Contribution

Design of GeneAgent: a cascaded pipeline (generate → self-verify → modify → summarize) that uses GPT-4 plus autonomous API calls to domain databases to verify claims about gene sets.

Integration with four Web APIs to access 18 curated biomedical databases (e.g., g:Profiler, Enrichr, NCBI E-utils, custom gene-centric APIs) for evidence-based verification.

Comprehensive evaluation on 1,106 gene sets (GO, NeST, MSigDB) and a real-world case with seven mouse melanoma gene sets showing quantitative gains over standard GPT-4 and improved mitigation of hallucinations.

Key Findings

GeneAgent increases n‑gram and LCS name overlap over GPT-4 on evaluated datasets.

NumbersROUGE-1/ROUGE-L from 23.9%→31.0% (MsigDB); ROUGE-2 7.4%→15.5%

GeneAgent yields higher semantic similarity to reference terms than GPT-4.

NumbersAverage similarity (MedCPT) = 0.705, 0.761, 0.736 across three datasets

More GeneAgent names rank highly versus a large background of terms.

Numbers76.9% (850/1,106) exceed the 90th percentile vs GPT-4 74.5% (742/1,106)

Self-verification is widely successful and edits follow unsupported claims.

Numbers15,903 claims; 15,848 (99.6%) verified; 84% supported, 8% refuted, 7% unknown; 88.5% of sets with unsupported claims got

Using verification reports as gene synopsis increases match to statistical enrichment terms.

NumbersExact-match overlap to g:Profiler significant terms: 80.7% (296/367) with verification report vs 56.0% without synopsis

Results

ROUGE (name overlap)

ValueMsigDB: ROUGE-1/ROUGE-L 23.9% → 31.0%; ROUGE-2 7.4% → 15.5%

BaselineGPT-4 (Hu et al.) scores

Semantic similarity (MedCPT avg)

ValueGeneAgent: 0.705, 0.761, 0.736 (three datasets)

BaselineGPT-4 average lower (stat. sig. p<0.05)

High-percentile placements

Value850/1,106 (76.9%) names >90th percentile vs GPT-4 742/1,106 (74.5%)

BaselineGPT-4 (Hu et al.)

Self-verification coverage

Value15,903 claims; 15,848 (99.6%) verified

BaselineN/A

Verification decisions breakdown

ValueSupported 84%; Partially 1%; Refuted 8%; Unknown 7%

BaselineN/A

Enrichment-term exact match (using verification report)

Value80.7% (296/367) matched g:Profiler significant terms

BaselineOntological synopsis 68.8%; no synopsis 56.0%

Human check of verification correctness

ValueHuman labels: 92% of decisions correct (122/132 claims)

BaselineN/A

Who Should Care

What To Try In 7 Days

Run GeneAgent on recent gene lists from your lab to get evidence‑backed process names and narratives.

Integrate verification reports as synopsis for downstream enrichment or curation steps.

Mask any database that contains your ground-truth during evaluation to avoid leakage, as the paper does.

Agent Features

Memory

  • No long-term memory reported

Planning

  • Cascaded generate → verify → modify → summarize

Tool Use

  • Autonomous Web API calls (g:Profiler, Enrichr, E-utils, CustomAPI)
  • Database-backed evidence retrieval and matching

Frameworks

  • Self-verification agent (selfVeri-Agent) that compiles verification reports

Is Agentic

true

Architectures

  • GPT-4 (LLM backbone)
  • MedCPT (biomedical encoder used for scoring)

Collaboration

  • Human expert evaluation for case study and verification audit

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Backbone limited to GPT-4; other LLMs not tested.
  • Self-verification can incorrectly refute or endorse names when relevant databases are missing.
  • Does not pre-process gene sets (no filtering of incoherent genes).
  • Performance depends on the coverage and quality of the consulted domain databases.

When Not To Use

  • When you require only raw GSEA p-values without narrative explanations.
  • When curated databases used by GeneAgent lack coverage for the species or genes of interest.

Failure Modes

  • Incorrectly refuting an accurate process name due to incomplete database evidence.
  • Incorrectly supporting a poor or hallucinated claim when matching signals are weak or noisy.
  • Residual hallucinations when databases lack explicit annotations for genes.

Core Entities

Models

  • GPT-4 (Azure, temp=0)
  • MedCPT (biomedical text encoder)

Metrics

  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • Semantic similarity (MedCPT embedding cosine)
  • Proportion exact-match enrichment terms

Datasets

  • Gene Ontology (GO) gene sets
  • NeST gene sets
  • MSigDB gene sets
  • Mouse B2905 melanoma gene sets (case study)

Benchmarks

  • Reference process terms from GO/NeST/MSigDB
  • g:Profiler enrichment (used as statistical ground truth)