Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
3
Why It Matters For Business
Teaching LLMs to call domain APIs gives far more accurate, traceable answers for database-style biomedical queries than pure retrieval or base LLMs.
Summary TLDR
GeneGPT teaches a code-oriented LLM (Codex) to call NCBI Web APIs (E-utils and BLAST) during generation. With a prompt that mixes API documentation and four short demonstrations plus an inference loop that executes URLs when the model emits a special marker, GeneGPT answers genomics questions far more accurately than vanilla LLMs and retrieval-based systems on the GeneTuring benchmark (macro-average 0.83 vs 0.44 for New Bing). A slim prompt with two demonstrations works nearly as well and the method generalizes to multi-hop questions via chained API calls (new GeneHop dataset). Errors cluster by task and point to practical extraction and API-argument failures.
Problem Statement
Autoregressive LLMs hallucinate when asked to report precise biomedical facts. Domain web APIs contain authoritative data but are hard for non-experts to use. Can we teach an LLM, via prompting and an API-aware decoding loop, to call database web APIs and use their results to answer precise genomics questions?
Main Contribution
GeneGPT: a prompting + decoding method that lets Codex generate API URLs, triggers real API calls, and ingests results to form answers.
State-of-the-art results on 8 GeneTuring genomics tasks (macro-average 0.83), outperforming retrieval-augmented and biomedical LLMs.
GeneHop: a new multi-hop genomics QA dataset showing GeneGPT can chain API calls to solve multi-step questions.
Key Findings
GeneGPT outperforms retrieval and domain LLMs on evaluated genomics QA.
A much smaller prompt (two demonstrations) achieves similar or better results than the full prompt.
GeneGPT generalizes to multi-hop questions by composing API calls.
Errors concentrate in predictable categories (wrong API, wrong args, missed extraction, unanswerable).
Results
GeneTuring overall average
GeneHop average (multi-hop)
Sequence alignment (DNA to multiple species)
Who Should Care
What To Try In 7 Days
Prototype an API-aware prompt: include 2 short demonstrations (alias and alignment) and API docs.
Use a code-oriented LLM (Codex-like) and implement an execution loop that detects a special marker and issues HTTP calls.
Run the prototype on a small domain benchmark and log error types to prioritize parsing or argument fixes.
Agent Features
Memory
- no persistent external memory
Planning
- chain-of-thought subquestion decomposition
Tool Use
- NCBI E-utils
- BLAST URL API
- runtime URL execution
Frameworks
- Codex (code-davinci-002) prompting
Is Agentic
true
Architectures
- LLM + Web API tool use
Collaboration
- single-agent tool calls
Optimization Features
Token Efficiency
- slim prompt with two demonstrations reduces prompt size
Inference Optimization
- stop-and-call decoding loop to execute API calls
Reproducibility
Data Urls
- GeneTuring public (referenced), GeneHop introduced in paper appendix
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on Codex-style model with code pretraining and long context.
- Fails when target information is absent in NCBI databases (unanswerable with API).
- Sensitive to argument formats and URL construction; wrong args cause errors.
- Paper evaluation uses exact-match scoring which is strict and dataset-limited.
When Not To Use
- When the needed knowledge is not covered by the target web API or database.
- When you cannot make external HTTP calls for privacy or regulatory reasons.
- If you lack a code-capable LLM or sufficient context window for demonstrations.
Failure Modes
- E1: wrong API choice or not using API (database selection errors)
- E2: correct API but incorrect arguments (formatting/parsing errors)
- E3: API result contains answer but model fails to extract it
- E4: API returns no relevant entry (unanswerable via API)
Core Entities
Models
- Codex (code-davinci-002)
- GPT-3 (text-davinci-003)
- ChatGPT
- New Bing
- BioGPT
- BioMedLM
Metrics
- Accuracy
- recall
- macro-average score
Datasets
- GeneTuring
- GeneHop
Benchmarks
- GeneTuring

