Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
GeneAgent reduces false functional claims by checking LLM outputs against curated biology databases, cutting manual validation time and producing more trustworthy gene‑set summaries for research pipelines.
Summary TLDR
GeneAgent is a language-agent built on GPT-4 that auto-calls curated biological databases to verify and edit its own gene-set analyses. On 1,106 gene sets from GO, NeST, and MSigDB, GeneAgent produces names and narratives closer to reference terms and reduces hallucinated functions. The system verifies individual claims against 18 domain databases via four APIs and flags or fixes unsupported claims. Manual checks and a seven‑set mouse melanoma case study show better relevance and new insights compared to vanilla GPT-4.
Problem Statement
Standard LLMs can suggest plausible but unsupported biological functions (hallucinations) for gene sets. Researchers need automated tools that combine LLM reasoning with factual checks against curated biomedical databases to produce reliable, interpretable gene‑set annotations.
Main Contribution
Design of GeneAgent: a cascaded pipeline (generate → self-verify → modify → summarize) that uses GPT-4 plus autonomous API calls to domain databases to verify claims about gene sets.
Integration with four Web APIs to access 18 curated biomedical databases (e.g., g:Profiler, Enrichr, NCBI E-utils, custom gene-centric APIs) for evidence-based verification.
Comprehensive evaluation on 1,106 gene sets (GO, NeST, MSigDB) and a real-world case with seven mouse melanoma gene sets showing quantitative gains over standard GPT-4 and improved mitigation of hallucinations.
Key Findings
GeneAgent increases n‑gram and LCS name overlap over GPT-4 on evaluated datasets.
GeneAgent yields higher semantic similarity to reference terms than GPT-4.
More GeneAgent names rank highly versus a large background of terms.
Self-verification is widely successful and edits follow unsupported claims.
Using verification reports as gene synopsis increases match to statistical enrichment terms.
Results
ROUGE (name overlap)
Semantic similarity (MedCPT avg)
High-percentile placements
Self-verification coverage
Verification decisions breakdown
Enrichment-term exact match (using verification report)
Human check of verification correctness
Who Should Care
What To Try In 7 Days
Run GeneAgent on recent gene lists from your lab to get evidence‑backed process names and narratives.
Integrate verification reports as synopsis for downstream enrichment or curation steps.
Mask any database that contains your ground-truth during evaluation to avoid leakage, as the paper does.
Agent Features
Memory
- No long-term memory reported
Planning
- Cascaded generate → verify → modify → summarize
Tool Use
- Autonomous Web API calls (g:Profiler, Enrichr, E-utils, CustomAPI)
- Database-backed evidence retrieval and matching
Frameworks
- Self-verification agent (selfVeri-Agent) that compiles verification reports
Is Agentic
true
Architectures
- GPT-4 (LLM backbone)
- MedCPT (biomedical encoder used for scoring)
Collaboration
- Human expert evaluation for case study and verification audit
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Backbone limited to GPT-4; other LLMs not tested.
- Self-verification can incorrectly refute or endorse names when relevant databases are missing.
- Does not pre-process gene sets (no filtering of incoherent genes).
- Performance depends on the coverage and quality of the consulted domain databases.
When Not To Use
- When you require only raw GSEA p-values without narrative explanations.
- When curated databases used by GeneAgent lack coverage for the species or genes of interest.
Failure Modes
- Incorrectly refuting an accurate process name due to incomplete database evidence.
- Incorrectly supporting a poor or hallucinated claim when matching signals are weak or noisy.
- Residual hallucinations when databases lack explicit annotations for genes.
Core Entities
Models
- GPT-4 (Azure, temp=0)
- MedCPT (biomedical text encoder)
Metrics
- ROUGE-1
- ROUGE-2
- ROUGE-L
- Semantic similarity (MedCPT embedding cosine)
- Proportion exact-match enrichment terms
Datasets
- Gene Ontology (GO) gene sets
- NeST gene sets
- MSigDB gene sets
- Mouse B2905 melanoma gene sets (case study)
Benchmarks
- Reference process terms from GO/NeST/MSigDB
- g:Profiler enrichment (used as statistical ground truth)

