Automating IEEE BioCompute Object creation from papers using RAG and LLMs

September 23, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

2

Authors

Sean Kim, Raja Mazumder

Links

Abstract / PDF

Why It Matters For Business

Automating BCO creation cuts manual work for documenting legacy bioinformatics workflows and speeds evaluation, handoff, and regulatory review when human verification is applied.

Summary TLDR

This paper builds a proof-of-concept tool (BCO assistant) that uses Retrieval-Augmented Generation (RAG) plus LLMs to auto-generate IEEE BioCompute Objects (BCOs) from scientific papers and optional GitHub repos. Key engineering choices: per-domain prompts, chunk+embed+vector-store retrieval, two-pass retrieval with a cross-encoder re-ranker, optional repo ingestion, and integrated automated and human evaluation. Code and docs are open on GitHub. The tool lowers manual work for retroactive documentation but still needs human review for missing repo-level details and to catch hallucinations.

Problem Statement

Creating standard-compliant BioCompute Objects for past bioinformatics studies is time-consuming. Papers often omit workflow details that live in external code repos. Manual BCO creation is a barrier to reproducibility and adoption of the IEEE BCO standard.

Main Contribution

A working BCO assistant that ingests a paper (PDF) and optional GitHub repo to auto-generate per-domain BCO JSON.

A RAG pipeline with chunking, embeddings, top-k retrieval, and a two-pass re-ranking (cross-encoder) to improve relevance.

Standardized, per-domain prompts and a split retrieval/LMM prompting strategy to reduce hallucination and improve schema conformity.

Optional GitHub ingestion to capture parametric and description details missing from papers.

Integrated evaluation stack: automated metrics via DeepEval plus a human evaluation UI and parameter-search wrappers for testing.

Key Findings

RAG plus LLMs can produce domain-specific BCO text from papers and repos.

Two-pass retrieval with a cross-encoder reranker improved output quality compared to only embedding-based retrieval.

Important run parameters and detailed pipeline steps are often only present in linked GitHub repositories, not in the paper.

Results

Answer relevancy (automated)

ValueEvaluated using DeepEval; no numeric score reported in paper

Faithfulness (automated)

ValueEvaluated using DeepEval; no numeric score reported in paper

Who Should Care

What To Try In 7 Days

Clone the repo and run BCO assistant on one paper using default settings.

Index a paper plus its public GitHub repo to see how repo ingestion changes outputs.

Run the provided evaluation UI to compare generated vs human-curated domains for one workflow.

Optimization Features

Token Efficiency

  • Per-domain generation reduces effective context needs

Infra Optimization

  • Planned microservices architecture for scaling and experimentation

System Optimization

  • Two-pass retrieval with cross-encoder reranker
  • Split retrieval embedding and LLM prompt to avoid polluting similarity scores

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Could not test largest frontier open-source LLMs due to compute and cost limits.
  • Tool depends on external libraries (e.g., LlamaIndex) which may hide low-level behavior.
  • Papers often omit parametric and file-location details; public GitHub is required to fill those gaps.
  • Automated evaluation is imperfect; human review remains necessary for accuracy and regulatory use.
  • Forcing strict JSON output can harm LLM reasoning and content quality.

When Not To Use

  • If the paper has no linked public code or data and precise run parameters are required.
  • For final regulatory submissions without human verification.
  • When the GitHub repository is private or inaccessible.

Failure Modes

  • LLM hallucination leading to incorrect or fabricated parameter values.
  • Missing parametric or file-location details when repos are not indexed.
  • JSON formatting or schema validation errors from constrained outputs.
  • Retrieval misses relevant content buried in long contexts.

Core Entities

Models

  • OpenAI API (unspecified LLMs used via API)
  • Llama 3.1 (discussed but not used due to compute limits)

Metrics

  • answer relevancy
  • faithfulness