Automating IEEE BioCompute Object creation from papers using RAG and LLMs

September 23, 20247 min

Overview

Decision SnapshotNeeds Validation

The prototype is functional and open-source, but relies on external LLM APIs, lacks large-model testing, and needs human review for critical fields.

Citations2

Evidence Strength0.60

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 0/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/2

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Sean Kim, Raja Mazumder

Links

Abstract / PDF / Code

Why It Matters For Business

Automating BCO creation cuts manual work for documenting legacy bioinformatics workflows and speeds evaluation, handoff, and regulatory review when human verification is applied.

Who Should Care

Summary TLDR

This paper builds a proof-of-concept tool (BCO assistant) that uses Retrieval-Augmented Generation (RAG) plus LLMs to auto-generate IEEE BioCompute Objects (BCOs) from scientific papers and optional GitHub repos. Key engineering choices: per-domain prompts, chunk+embed+vector-store retrieval, two-pass retrieval with a cross-encoder re-ranker, optional repo ingestion, and integrated automated and human evaluation. Code and docs are open on GitHub. The tool lowers manual work for retroactive documentation but still needs human review for missing repo-level details and to catch hallucinations.

Problem Statement

Creating standard-compliant BioCompute Objects for past bioinformatics studies is time-consuming. Papers often omit workflow details that live in external code repos. Manual BCO creation is a barrier to reproducibility and adoption of the IEEE BCO standard.

Main Contribution

A working BCO assistant that ingests a paper (PDF) and optional GitHub repo to auto-generate per-domain BCO JSON.

A RAG pipeline with chunking, embeddings, top-k retrieval, and a two-pass re-ranking (cross-encoder) to improve relevance.

Key Findings

RAG plus LLMs can produce domain-specific BCO text from papers and repos.

Practical UseUse a RAG pipeline to extract paper content and format it into BCO domains, then have a human verify outputs.

Evidence RefAbstract; Implementation; Tool Overview

Two-pass retrieval with a cross-encoder reranker improved output quality compared to only embedding-based retrieval.

Practical UseImplement a cheap first-pass embedding search followed by a focused cross-encoder re-rank for better relevance.

Evidence RefImprovements section describing reranking and Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Answer relevancy (automated)Evaluated using DeepEval; no numeric score reported in paperGenerated BCO domainsImprovements and Extensibility sections describing DeepEval relevancy metric
Faithfulness (automated)Evaluated using DeepEval; no numeric score reported in paperGenerated BCO domains vs retrieved nodesImprovements and Extensibility sections describing DeepEval faithfulness metric

What To Try In 7 Days

Clone the repo and run BCO assistant on one paper using default settings.

Index a paper plus its public GitHub repo to see how repo ingestion changes outputs.

Run the provided evaluation UI to compare generated vs human-curated domains for one workflow.

Optimization Features

Token Efficiency
Per-domain generation reduces effective context needs
Infra Optimization
Planned microservices architecture for scaling and experimentation
System Optimization
Two-pass retrieval with cross-encoder rerankerSplit retrieval embedding and LLM prompt to avoid polluting similarity scores

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Could not test largest frontier open-source LLMs due to compute and cost limits.

Tool depends on external libraries (e.g., LlamaIndex) which may hide low-level behavior.

When Not To Use

If the paper has no linked public code or data and precise run parameters are required.

For final regulatory submissions without human verification.

Failure Modes

LLM hallucination leading to incorrect or fabricated parameter values.

Missing parametric or file-location details when repos are not indexed.

Core Entities

Models

OpenAI API (unspecified LLMs used via API)Llama 3.1 (discussed but not used due to compute limits)

Metrics

answer relevancyfaithfulness