Teach an LLM to call NCBI Web APIs and cut hallucinations on genomics QA

Overview

Decision SnapshotNeeds Validation

The method is straightforward to prototype using a code-oriented LLM and HTTP execution; it needs extra engineering for robust argument formatting, parsing, and API failures before production.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Qiao Jin, Yifan Yang, Qingyu Chen, Zhiyong Lu

Links

Abstract / PDF / Data

Why It Matters For Business

Teaching LLMs to call domain APIs gives far more accurate, traceable answers for database-style biomedical queries than pure retrieval or base LLMs.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

GeneGPT teaches a code-oriented LLM (Codex) to call NCBI Web APIs (E-utils and BLAST) during generation. With a prompt that mixes API documentation and four short demonstrations plus an inference loop that executes URLs when the model emits a special marker, GeneGPT answers genomics questions far more accurately than vanilla LLMs and retrieval-based systems on the GeneTuring benchmark (macro-average 0.83 vs 0.44 for New Bing). A slim prompt with two demonstrations works nearly as well and the method generalizes to multi-hop questions via chained API calls (new GeneHop dataset). Errors cluster by task and point to practical extraction and API-argument failures.

Problem Statement

Autoregressive LLMs hallucinate when asked to report precise biomedical facts. Domain web APIs contain authoritative data but are hard for non-experts to use. Can we teach an LLM, via prompting and an API-aware decoding loop, to call database web APIs and use their results to answer precise genomics questions?

Main Contribution

GeneGPT: a prompting + decoding method that lets Codex generate API URLs, triggers real API calls, and ingests results to form answers.

State-of-the-art results on 8 GeneTuring genomics tasks (macro-average 0.83), outperforming retrieval-augmented and biomedical LLMs.

Key Findings

GeneGPT outperforms retrieval and domain LLMs on evaluated genomics QA.

NumbersMacro-average 0.83 (GeneGPT) vs 0.44 (New Bing) on GeneTuring

Practical UseUse API-augmented LLMs for database-like biomedical queries instead of plain retrieval-augmented models when authoritative entries matter.

Evidence RefTable 2, §3.4

A much smaller prompt (two demonstrations) achieves similar or better results than the full prompt.

NumbersGeneGPT-slim overall 0.83 vs GeneGPT-full 0.80

Practical UseYou can get strong gains with only two concise, task-relevant API call examples — keep prompts small and focused.

Evidence RefTable 2, §4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GeneTuring overall average	0.83 (GeneGPT)	0.44 (New Bing)	+0.39	GeneTuring (selected 9 tasks)	Table 2 (main results)	Table 2, §3.4
GeneHop average (multi-hop)	0.50 (GeneGPT)	0.24 (New Bing)	+0.26	GeneHop (3 tasks, 50 q each)	Table 3, §4.2	Table 3

What To Try In 7 Days

Prototype an API-aware prompt: include 2 short demonstrations (alias and alignment) and API docs.

Use a code-oriented LLM (Codex-like) and implement an execution loop that detects a special marker and issues HTTP calls.

Run the prototype on a small domain benchmark and log error types to prioritize parsing or argument fixes.

Agent Features

Memory

no persistent external memory

Planning

chain-of-thought subquestion decomposition

Tool Use

NCBI E-utilsBLAST URL APIruntime URL execution

Frameworks

Codex (code-davinci-002) prompting

Is Agentic

Yes

Architectures

LLM + Web API tool use

Collaboration

single-agent tool calls

Optimization Features

Token Efficiency

slim prompt with two demonstrations reduces prompt size

Inference Optimization

stop-and-call decoding loop to execute API calls

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

GeneTuring public (referenced), GeneHop introduced in paper appendix

Risks & Boundaries

Limitations

Relies on Codex-style model with code pretraining and long context.

Fails when target information is absent in NCBI databases (unanswerable with API).

When Not To Use

When the needed knowledge is not covered by the target web API or database.

When you cannot make external HTTP calls for privacy or regulatory reasons.

Failure Modes

E1: wrong API choice or not using API (database selection errors)

E2: correct API but incorrect arguments (formatting/parsing errors)

Core Entities

Models

Codex (code-davinci-002)GPT-3 (text-davinci-003)ChatGPTNew BingBioGPTBioMedLM

Metrics

Accuracyrecallmacro-average score

Datasets

GeneTuringGeneHop

Benchmarks

GeneTuring

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GeneGPT outperforms retrieval and domain LLMs on evaluated genomics QA.

A much smaller prompt (two demonstrations) achieves similar or better results than the full prompt.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

ETAPP: an 800-case sandbox benchmark and key-point LLM evaluator for personalized tool use

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

Key finding

Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

Key finding