Automating IEEE BioCompute Object creation from papers using RAG and LLMs

Overview

Decision SnapshotNeeds Validation

The prototype is functional and open-source, but relies on external LLM APIs, lacks large-model testing, and needs human review for critical fields.

Citations2

Evidence Strength0.60

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 0/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/2

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Sean Kim, Raja Mazumder

Links

Abstract / PDF / Code

Why It Matters For Business

Automating BCO creation cuts manual work for documenting legacy bioinformatics workflows and speeds evaluation, handoff, and regulatory review when human verification is applied.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

This paper builds a proof-of-concept tool (BCO assistant) that uses Retrieval-Augmented Generation (RAG) plus LLMs to auto-generate IEEE BioCompute Objects (BCOs) from scientific papers and optional GitHub repos. Key engineering choices: per-domain prompts, chunk+embed+vector-store retrieval, two-pass retrieval with a cross-encoder re-ranker, optional repo ingestion, and integrated automated and human evaluation. Code and docs are open on GitHub. The tool lowers manual work for retroactive documentation but still needs human review for missing repo-level details and to catch hallucinations.

Problem Statement

Creating standard-compliant BioCompute Objects for past bioinformatics studies is time-consuming. Papers often omit workflow details that live in external code repos. Manual BCO creation is a barrier to reproducibility and adoption of the IEEE BCO standard.

Main Contribution

A working BCO assistant that ingests a paper (PDF) and optional GitHub repo to auto-generate per-domain BCO JSON.

A RAG pipeline with chunking, embeddings, top-k retrieval, and a two-pass re-ranking (cross-encoder) to improve relevance.

Key Findings

RAG plus LLMs can produce domain-specific BCO text from papers and repos.

Practical UseUse a RAG pipeline to extract paper content and format it into BCO domains, then have a human verify outputs.

Evidence RefAbstract; Implementation; Tool Overview

Two-pass retrieval with a cross-encoder reranker improved output quality compared to only embedding-based retrieval.

Practical UseImplement a cheap first-pass embedding search followed by a focused cross-encoder re-rank for better relevance.

Evidence RefImprovements section describing reranking and Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Answer relevancy (automated)	Evaluated using DeepEval; no numeric score reported in paper	—	—	Generated BCO domains	Improvements and Extensibility sections describing DeepEval relevancy metric	—
Faithfulness (automated)	Evaluated using DeepEval; no numeric score reported in paper	—	—	Generated BCO domains vs retrieved nodes	Improvements and Extensibility sections describing DeepEval faithfulness metric	—

What To Try In 7 Days

Clone the repo and run BCO assistant on one paper using default settings.

Index a paper plus its public GitHub repo to see how repo ingestion changes outputs.

Run the provided evaluation UI to compare generated vs human-curated domains for one workflow.

Optimization Features

Token Efficiency

Per-domain generation reduces effective context needs

Infra Optimization

Planned microservices architecture for scaling and experimentation

System Optimization

Two-pass retrieval with cross-encoder rerankerSplit retrieval embedding and LLM prompt to avoid polluting similarity scores

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/biocomputeobjects/bco-rag

Risks & Boundaries

Limitations

Could not test largest frontier open-source LLMs due to compute and cost limits.

Tool depends on external libraries (e.g., LlamaIndex) which may hide low-level behavior.

When Not To Use

If the paper has no linked public code or data and precise run parameters are required.

For final regulatory submissions without human verification.

Failure Modes

LLM hallucination leading to incorrect or fabricated parameter values.

Missing parametric or file-location details when repos are not indexed.

Core Entities

Models

OpenAI API (unspecified LLMs used via API)Llama 3.1 (discussed but not used due to compute limits)

Metrics

answer relevancyfaithfulness

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RAG plus LLMs can produce domain-specific BCO text from papers and repos.

Two-pass retrieval with a cross-encoder reranker improved output quality compared to only embedding-based retrieval.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

You May Also Want to Read

MTRAG: a human-made benchmark of multi-turn RAG conversations that stresses retrieval, unanswerables, and later-turn context.

Key finding

Atomic fact-checking for medical RAG LLMs boosts factuality and traceability

Key finding

Build query-specific evidence graphs on the fly to fix missing links and filter distractor facts

Key finding

RAGLAB — an open, modular toolkit to reproduce, compare and develop RAG algorithms fairly

Key finding

InsQABench: a Chinese insurance QA benchmark plus SQL-ReAct and RAG-ReAct methods

Key finding