Overview
The method is straightforward to prototype using a code-oriented LLM and HTTP execution; it needs extra engineering for robust argument formatting, parsing, and API failures before production.
Citations3
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Teaching LLMs to call domain APIs gives far more accurate, traceable answers for database-style biomedical queries than pure retrieval or base LLMs.
Who Should Care
Summary TLDR
GeneGPT teaches a code-oriented LLM (Codex) to call NCBI Web APIs (E-utils and BLAST) during generation. With a prompt that mixes API documentation and four short demonstrations plus an inference loop that executes URLs when the model emits a special marker, GeneGPT answers genomics questions far more accurately than vanilla LLMs and retrieval-based systems on the GeneTuring benchmark (macro-average 0.83 vs 0.44 for New Bing). A slim prompt with two demonstrations works nearly as well and the method generalizes to multi-hop questions via chained API calls (new GeneHop dataset). Errors cluster by task and point to practical extraction and API-argument failures.
Problem Statement
Autoregressive LLMs hallucinate when asked to report precise biomedical facts. Domain web APIs contain authoritative data but are hard for non-experts to use. Can we teach an LLM, via prompting and an API-aware decoding loop, to call database web APIs and use their results to answer precise genomics questions?
Main Contribution
GeneGPT: a prompting + decoding method that lets Codex generate API URLs, triggers real API calls, and ingests results to form answers.
State-of-the-art results on 8 GeneTuring genomics tasks (macro-average 0.83), outperforming retrieval-augmented and biomedical LLMs.
Key Findings
GeneGPT outperforms retrieval and domain LLMs on evaluated genomics QA.
A much smaller prompt (two demonstrations) achieves similar or better results than the full prompt.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GeneTuring overall average | 0.83 (GeneGPT) | 0.44 (New Bing) | +0.39 | GeneTuring (selected 9 tasks) | Table 2 (main results) | Table 2, §3.4 |
| GeneHop average (multi-hop) | 0.50 (GeneGPT) | 0.24 (New Bing) | +0.26 | GeneHop (3 tasks, 50 q each) | Table 3, §4.2 | Table 3 |
What To Try In 7 Days
Prototype an API-aware prompt: include 2 short demonstrations (alias and alignment) and API docs.
Use a code-oriented LLM (Codex-like) and implement an execution loop that detects a special marker and issues HTTP calls.
Run the prototype on a small domain benchmark and log error types to prioritize parsing or argument fixes.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Relies on Codex-style model with code pretraining and long context.
Fails when target information is absent in NCBI databases (unanswerable with API).
When Not To Use
When the needed knowledge is not covered by the target web API or database.
When you cannot make external HTTP calls for privacy or regulatory reasons.
Failure Modes
E1: wrong API choice or not using API (database selection errors)
E2: correct API but incorrect arguments (formatting/parsing errors)

