Teach an LLM to call NCBI Web APIs and cut hallucinations on genomics QA

April 19, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

3

Authors

Qiao Jin, Yifan Yang, Qingyu Chen, Zhiyong Lu

Links

Abstract / PDF

Why It Matters For Business

Teaching LLMs to call domain APIs gives far more accurate, traceable answers for database-style biomedical queries than pure retrieval or base LLMs.

Summary TLDR

GeneGPT teaches a code-oriented LLM (Codex) to call NCBI Web APIs (E-utils and BLAST) during generation. With a prompt that mixes API documentation and four short demonstrations plus an inference loop that executes URLs when the model emits a special marker, GeneGPT answers genomics questions far more accurately than vanilla LLMs and retrieval-based systems on the GeneTuring benchmark (macro-average 0.83 vs 0.44 for New Bing). A slim prompt with two demonstrations works nearly as well and the method generalizes to multi-hop questions via chained API calls (new GeneHop dataset). Errors cluster by task and point to practical extraction and API-argument failures.

Problem Statement

Autoregressive LLMs hallucinate when asked to report precise biomedical facts. Domain web APIs contain authoritative data but are hard for non-experts to use. Can we teach an LLM, via prompting and an API-aware decoding loop, to call database web APIs and use their results to answer precise genomics questions?

Main Contribution

GeneGPT: a prompting + decoding method that lets Codex generate API URLs, triggers real API calls, and ingests results to form answers.

State-of-the-art results on 8 GeneTuring genomics tasks (macro-average 0.83), outperforming retrieval-augmented and biomedical LLMs.

GeneHop: a new multi-hop genomics QA dataset showing GeneGPT can chain API calls to solve multi-step questions.

Key Findings

GeneGPT outperforms retrieval and domain LLMs on evaluated genomics QA.

NumbersMacro-average 0.83 (GeneGPT) vs 0.44 (New Bing) on GeneTuring

A much smaller prompt (two demonstrations) achieves similar or better results than the full prompt.

NumbersGeneGPT-slim overall 0.83 vs GeneGPT-full 0.80

GeneGPT generalizes to multi-hop questions by composing API calls.

NumbersGeneHop average 0.50 (GeneGPT) vs 0.24 (New Bing)

Errors concentrate in predictable categories (wrong API, wrong args, missed extraction, unanswerable).

NumbersGene disease association had 15 wrong-API errors (E1) out of 50

Results

GeneTuring overall average

Value0.83 (GeneGPT)

Baseline0.44 (New Bing)

GeneHop average (multi-hop)

Value0.50 (GeneGPT)

Baseline0.24 (New Bing)

Sequence alignment (DNA to multiple species)

Value0.86 (GeneGPT)

Baseline0.00 (New Bing)

Who Should Care

What To Try In 7 Days

Prototype an API-aware prompt: include 2 short demonstrations (alias and alignment) and API docs.

Use a code-oriented LLM (Codex-like) and implement an execution loop that detects a special marker and issues HTTP calls.

Run the prototype on a small domain benchmark and log error types to prioritize parsing or argument fixes.

Agent Features

Memory

  • no persistent external memory

Planning

  • chain-of-thought subquestion decomposition

Tool Use

  • NCBI E-utils
  • BLAST URL API
  • runtime URL execution

Frameworks

  • Codex (code-davinci-002) prompting

Is Agentic

true

Architectures

  • LLM + Web API tool use

Collaboration

  • single-agent tool calls

Optimization Features

Token Efficiency

  • slim prompt with two demonstrations reduces prompt size

Inference Optimization

  • stop-and-call decoding loop to execute API calls

Reproducibility

Data Urls

  • GeneTuring public (referenced), GeneHop introduced in paper appendix

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on Codex-style model with code pretraining and long context.
  • Fails when target information is absent in NCBI databases (unanswerable with API).
  • Sensitive to argument formats and URL construction; wrong args cause errors.
  • Paper evaluation uses exact-match scoring which is strict and dataset-limited.

When Not To Use

  • When the needed knowledge is not covered by the target web API or database.
  • When you cannot make external HTTP calls for privacy or regulatory reasons.
  • If you lack a code-capable LLM or sufficient context window for demonstrations.

Failure Modes

  • E1: wrong API choice or not using API (database selection errors)
  • E2: correct API but incorrect arguments (formatting/parsing errors)
  • E3: API result contains answer but model fails to extract it
  • E4: API returns no relevant entry (unanswerable via API)

Core Entities

Models

  • Codex (code-davinci-002)
  • GPT-3 (text-davinci-003)
  • ChatGPT
  • New Bing
  • BioGPT
  • BioMedLM

Metrics

  • Accuracy
  • recall
  • macro-average score

Datasets

  • GeneTuring
  • GeneHop

Benchmarks

  • GeneTuring