Harmonia: an LLM-driven agent that interactively builds reproducible data harmonization pipelines

February 10, 20256 min

Overview

Decision SnapshotNeeds Validation

The prototype shows clear gains in a biomedical use case and provides reproducible plans, but robustness, scaling, and consistent automation require more work and user oversight.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/2

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Agentic harmonization can speed up combining heterogeneous datasets and produce reusable, publishable transformation scripts that improve reproducibility and reduce manual engineering time.

Who Should Care

Summary TLDR

This paper proposes "agentic" data harmonization: LLM-based agents that interact with users and modular data-integration routines to synthesize reusable harmonization pipelines. The authors implement Harmonia, a prototype that combines a library of primitives (bdi-kit), an LLM agent (GPT-4o via Archytas), and a chat-style UI (Beaker). In a clinical use case mapping a cohort to the GDC standard, Harmonia outperforms baseline primitives (schema-matching accuracy 1.0 vs 0.88; value-mapping F1 0.68 vs 0.57). The paper discusses practical limits: LLM brittleness, context-window and evaluation gaps, and the need for provenance, uncertainty, and better benchmarks.

Problem Statement

Combining tables from different sources requires mapping column names and standardizing values. Existing work uses scripts and ad-hoc tools that are slow, brittle, and poorly documented. LLMs can help with language and code, but they are inconsistent, sensitive to prompts, and do not by themselves provide scalable, reproducible harmonization pipelines.

Main Contribution

Present a vision for agentic data harmonization that combines LLM reasoning, user interaction, and composable primitives.

Introduce Harmonia, a working prototype that integrates bdi-kit primitives, Archytas-based LLM tool-calling, and a Beaker chat UI.

Key Findings

Harmonia produced perfect schema-matching on the evaluated use case.

NumbersSchema accuracy Harmonia=1.00 vs Baseline=0.88

Practical UseUse an agent to orchestrate algorithms plus LLM checks to improve schema matching accuracy on similar biomedical tasks.

Evidence RefTable 1 (Schema Matching)

LLM-augmented pipeline improved value-mapping F1 over the baseline primitives.

NumbersValue mapping F1 Harmonia=0.68 vs Baseline=0.57

Practical UseLeverage LLMs as evaluators and correctors on top of automated matchers to raise value-mapping quality.

Evidence RefTable 1 (Value Mapping)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy1.000.88+0.12Dou et al. cohort → GDC mappingTable 1 reports Harmonia accuracy 1 and baseline 0.88Table 1
Value mapping F10.680.57+0.11Dou et al. cohort → GDC mappingTable 1 reports Harmonia F1 0.68 and baseline F1 0.57Table 1

What To Try In 7 Days

Clone the Harmonia repo and run the demo mapping a CSV to the GDC schema.

Replace one manual mapping script with a harmonization plan and materialize the mapping to save reproducible output.

Measure time and error rate vs your current manual harmonization process.

Agent Features

Memory
Provenance DB (history of actions and interactions)
Planning
tool callingtask decompositiondynamic pipeline synthesis
Tool Use
bdi-kit primitiveson-demand Python code generationArchytas tool-calling
Frameworks
ReActArchytasBeaker
Is Agentic

Yes

Architectures
LLM-driven agent loop
Collaboration
two-way user-agent interaction

Optimization Features

Token Efficiency
objective to minimize LLM calls (discussed)
Infra Optimization
use of external Provenance DB to avoid context window limits
System Optimization
balance objectives: correctness, user interactions, compute cost

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

https://gdc.cancer.gov/developers/gdc-data-modelDou et al. cohort (referenced in paper)

Risks & Boundaries

Limitations

LLM brittleness and inconsistent corrections across runs

Context-window problems for large tables and long workflows

When Not To Use

When you require fully automated, unattended harmonization for critical systems without human review

On extremely large schemas without external storage to manage context

Failure Modes

LLM hallucinations leading to incorrect mappings

Prompt sensitivity producing different pipelines for the same task

Core Entities

Models

GPT-4o

Metrics

AccuracyprecisionrecallF1

Datasets

Dou et al. cohort (used in demo)GDC (Genomic Data Commons) standard

Context Entities

Models

small LLMs for primitives (mentioned)LLM ensembles (discussion)

Metrics

similarity scores exposed by primitives

Datasets

various cancer cohort tables cited

Benchmarks

task-level benchmarks cited (schema matching, entity linking)