Harmonia: an LLM-driven agent that interactively builds reproducible data harmonization pipelines

February 10, 20256 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire

Links

Abstract / PDF

Why It Matters For Business

Agentic harmonization can speed up combining heterogeneous datasets and produce reusable, publishable transformation scripts that improve reproducibility and reduce manual engineering time.

Summary TLDR

This paper proposes "agentic" data harmonization: LLM-based agents that interact with users and modular data-integration routines to synthesize reusable harmonization pipelines. The authors implement Harmonia, a prototype that combines a library of primitives (bdi-kit), an LLM agent (GPT-4o via Archytas), and a chat-style UI (Beaker). In a clinical use case mapping a cohort to the GDC standard, Harmonia outperforms baseline primitives (schema-matching accuracy 1.0 vs 0.88; value-mapping F1 0.68 vs 0.57). The paper discusses practical limits: LLM brittleness, context-window and evaluation gaps, and the need for provenance, uncertainty, and better benchmarks.

Problem Statement

Combining tables from different sources requires mapping column names and standardizing values. Existing work uses scripts and ad-hoc tools that are slow, brittle, and poorly documented. LLMs can help with language and code, but they are inconsistent, sensitive to prompts, and do not by themselves provide scalable, reproducible harmonization pipelines.

Main Contribution

Present a vision for agentic data harmonization that combines LLM reasoning, user interaction, and composable primitives.

Introduce Harmonia, a working prototype that integrates bdi-kit primitives, Archytas-based LLM tool-calling, and a Beaker chat UI.

Demonstrate a clinical use case mapping a real cohort to the GDC schema and produce a reusable declarative mapping.

Discuss open problems: evaluation benchmarks, uncertainty/explanations, robustness, provenance, and UI design.

Key Findings

Harmonia produced perfect schema-matching on the evaluated use case.

NumbersSchema accuracy Harmonia=1.00 vs Baseline=0.88

LLM-augmented pipeline improved value-mapping F1 over the baseline primitives.

NumbersValue mapping F1 Harmonia=0.68 vs Baseline=0.57

LLMs are helpful but inconsistent: they sometimes fail to correct errors even with identical prompts.

NumbersQualitative observation; inconsistency observed across repeated runs

Results

Accuracy

Value1.00

Baseline0.88

Value mapping F1

Value0.68

Baseline0.57

Who Should Care

What To Try In 7 Days

Clone the Harmonia repo and run the demo mapping a CSV to the GDC schema.

Replace one manual mapping script with a harmonization plan and materialize the mapping to save reproducible output.

Measure time and error rate vs your current manual harmonization process.

Agent Features

Memory

  • Provenance DB (history of actions and interactions)

Planning

  • tool calling
  • task decomposition
  • dynamic pipeline synthesis

Tool Use

  • bdi-kit primitives
  • on-demand Python code generation
  • Archytas tool-calling

Frameworks

  • ReAct
  • Archytas
  • Beaker

Is Agentic

true

Architectures

  • LLM-driven agent loop

Collaboration

  • two-way user-agent interaction

Optimization Features

Token Efficiency

  • objective to minimize LLM calls (discussed)

Infra Optimization

  • use of external Provenance DB to avoid context window limits

System Optimization

  • balance objectives: correctness, user interactions, compute cost

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • LLM brittleness and inconsistent corrections across runs
  • Context-window problems for large tables and long workflows
  • Lack of end-to-end benchmarks to evaluate agentic harmonization
  • Current UI is chat-based only; richer visual tools are needed

When Not To Use

  • When you require fully automated, unattended harmonization for critical systems without human review
  • On extremely large schemas without external storage to manage context
  • When strict certification or audit trails are legally required unless provenance is integrated

Failure Modes

  • LLM hallucinations leading to incorrect mappings
  • Prompt sensitivity producing different pipelines for the same task
  • Loss of earlier context causing wrong decisions in long workflows
  • Over-reliance on LLM evaluation that reflects model bias

Core Entities

Models

  • GPT-4o

Metrics

  • Accuracy
  • precision
  • recall
  • F1

Datasets

  • Dou et al. cohort (used in demo)
  • GDC (Genomic Data Commons) standard

Context Entities

Models

  • small LLMs for primitives (mentioned)
  • LLM ensembles (discussion)

Metrics

  • similarity scores exposed by primitives

Datasets

  • various cancer cohort tables cited

Benchmarks

  • task-level benchmarks cited (schema matching, entity linking)