Teach an LLM to call NCBI Web APIs and cut hallucinations on genomics QA

April 19, 20237 min

Overview

Decision SnapshotNeeds Validation

The method is straightforward to prototype using a code-oriented LLM and HTTP execution; it needs extra engineering for robust argument formatting, parsing, and API failures before production.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Qiao Jin, Yifan Yang, Qingyu Chen, Zhiyong Lu

Links

Abstract / PDF / Data

Why It Matters For Business

Teaching LLMs to call domain APIs gives far more accurate, traceable answers for database-style biomedical queries than pure retrieval or base LLMs.

Who Should Care

Summary TLDR

GeneGPT teaches a code-oriented LLM (Codex) to call NCBI Web APIs (E-utils and BLAST) during generation. With a prompt that mixes API documentation and four short demonstrations plus an inference loop that executes URLs when the model emits a special marker, GeneGPT answers genomics questions far more accurately than vanilla LLMs and retrieval-based systems on the GeneTuring benchmark (macro-average 0.83 vs 0.44 for New Bing). A slim prompt with two demonstrations works nearly as well and the method generalizes to multi-hop questions via chained API calls (new GeneHop dataset). Errors cluster by task and point to practical extraction and API-argument failures.

Problem Statement

Autoregressive LLMs hallucinate when asked to report precise biomedical facts. Domain web APIs contain authoritative data but are hard for non-experts to use. Can we teach an LLM, via prompting and an API-aware decoding loop, to call database web APIs and use their results to answer precise genomics questions?

Main Contribution

GeneGPT: a prompting + decoding method that lets Codex generate API URLs, triggers real API calls, and ingests results to form answers.

State-of-the-art results on 8 GeneTuring genomics tasks (macro-average 0.83), outperforming retrieval-augmented and biomedical LLMs.

Key Findings

GeneGPT outperforms retrieval and domain LLMs on evaluated genomics QA.

NumbersMacro-average 0.83 (GeneGPT) vs 0.44 (New Bing) on GeneTuring

Practical UseUse API-augmented LLMs for database-like biomedical queries instead of plain retrieval-augmented models when authoritative entries matter.

Evidence RefTable 2, §3.4

A much smaller prompt (two demonstrations) achieves similar or better results than the full prompt.

NumbersGeneGPT-slim overall 0.83 vs GeneGPT-full 0.80

Practical UseYou can get strong gains with only two concise, task-relevant API call examples — keep prompts small and focused.

Evidence RefTable 2, §4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GeneTuring overall average0.83 (GeneGPT)0.44 (New Bing)+0.39GeneTuring (selected 9 tasks)Table 2 (main results)Table 2, §3.4
GeneHop average (multi-hop)0.50 (GeneGPT)0.24 (New Bing)+0.26GeneHop (3 tasks, 50 q each)Table 3, §4.2Table 3

What To Try In 7 Days

Prototype an API-aware prompt: include 2 short demonstrations (alias and alignment) and API docs.

Use a code-oriented LLM (Codex-like) and implement an execution loop that detects a special marker and issues HTTP calls.

Run the prototype on a small domain benchmark and log error types to prioritize parsing or argument fixes.

Agent Features

Memory
no persistent external memory
Planning
chain-of-thought subquestion decomposition
Tool Use
NCBI E-utilsBLAST URL APIruntime URL execution
Frameworks
Codex (code-davinci-002) prompting
Is Agentic

Yes

Architectures
LLM + Web API tool use
Collaboration
single-agent tool calls

Optimization Features

Token Efficiency
slim prompt with two demonstrations reduces prompt size
Inference Optimization
stop-and-call decoding loop to execute API calls

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

GeneTuring public (referenced), GeneHop introduced in paper appendix

Risks & Boundaries

Limitations

Relies on Codex-style model with code pretraining and long context.

Fails when target information is absent in NCBI databases (unanswerable with API).

When Not To Use

When the needed knowledge is not covered by the target web API or database.

When you cannot make external HTTP calls for privacy or regulatory reasons.

Failure Modes

E1: wrong API choice or not using API (database selection errors)

E2: correct API but incorrect arguments (formatting/parsing errors)

Core Entities

Models

Codex (code-davinci-002)GPT-3 (text-davinci-003)ChatGPTNew BingBioGPTBioMedLM

Metrics

Accuracyrecallmacro-average score

Datasets

GeneTuringGeneHop

Benchmarks

GeneTuring