Harmonia: an LLM-driven agent that interactively builds reproducible data harmonization pipelines

Overview

Decision SnapshotNeeds Validation

The prototype shows clear gains in a biomedical use case and provides reproducible plans, but robustness, scaling, and consistent automation require more work and user oversight.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/2

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Agentic harmonization can speed up combining heterogeneous datasets and produce reusable, publishable transformation scripts that improve reproducibility and reduce manual engineering time.

Who Should Care

Data Scientist ML Engineer Product Manager Engineering Lead

Summary TLDR

This paper proposes "agentic" data harmonization: LLM-based agents that interact with users and modular data-integration routines to synthesize reusable harmonization pipelines. The authors implement Harmonia, a prototype that combines a library of primitives (bdi-kit), an LLM agent (GPT-4o via Archytas), and a chat-style UI (Beaker). In a clinical use case mapping a cohort to the GDC standard, Harmonia outperforms baseline primitives (schema-matching accuracy 1.0 vs 0.88; value-mapping F1 0.68 vs 0.57). The paper discusses practical limits: LLM brittleness, context-window and evaluation gaps, and the need for provenance, uncertainty, and better benchmarks.

Problem Statement

Combining tables from different sources requires mapping column names and standardizing values. Existing work uses scripts and ad-hoc tools that are slow, brittle, and poorly documented. LLMs can help with language and code, but they are inconsistent, sensitive to prompts, and do not by themselves provide scalable, reproducible harmonization pipelines.

Main Contribution

Present a vision for agentic data harmonization that combines LLM reasoning, user interaction, and composable primitives.

Introduce Harmonia, a working prototype that integrates bdi-kit primitives, Archytas-based LLM tool-calling, and a Beaker chat UI.

Key Findings

Harmonia produced perfect schema-matching on the evaluated use case.

NumbersSchema accuracy Harmonia=1.00 vs Baseline=0.88

Practical UseUse an agent to orchestrate algorithms plus LLM checks to improve schema matching accuracy on similar biomedical tasks.

Evidence RefTable 1 (Schema Matching)

LLM-augmented pipeline improved value-mapping F1 over the baseline primitives.

NumbersValue mapping F1 Harmonia=0.68 vs Baseline=0.57

Practical UseLeverage LLMs as evaluators and correctors on top of automated matchers to raise value-mapping quality.

Evidence RefTable 1 (Value Mapping)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	1.00	0.88	+0.12	Dou et al. cohort → GDC mapping	Table 1 reports Harmonia accuracy 1 and baseline 0.88	Table 1
Value mapping F1	0.68	0.57	+0.11	Dou et al. cohort → GDC mapping	Table 1 reports Harmonia F1 0.68 and baseline F1 0.57	Table 1

What To Try In 7 Days

Clone the Harmonia repo and run the demo mapping a CSV to the GDC schema.

Replace one manual mapping script with a harmonization plan and materialize the mapping to save reproducible output.

Measure time and error rate vs your current manual harmonization process.

Agent Features

Memory

Provenance DB (history of actions and interactions)

Planning

tool callingtask decompositiondynamic pipeline synthesis

Tool Use

bdi-kit primitiveson-demand Python code generationArchytas tool-calling

Frameworks

ReActArchytasBeaker

Is Agentic

Yes

Architectures

LLM-driven agent loop

Collaboration

two-way user-agent interaction

Optimization Features

Token Efficiency

objective to minimize LLM calls (discussed)

Infra Optimization

use of external Provenance DB to avoid context window limits

System Optimization

balance objectives: correctness, user interactions, compute cost

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/VIDA-NYU/harmonia/https://github.com/VIDANYU/bdi-kit

Data URLs

https://gdc.cancer.gov/developers/gdc-data-modelDou et al. cohort (referenced in paper)

Risks & Boundaries

Limitations

LLM brittleness and inconsistent corrections across runs

Context-window problems for large tables and long workflows

When Not To Use

When you require fully automated, unattended harmonization for critical systems without human review

On extremely large schemas without external storage to manage context

Failure Modes

LLM hallucinations leading to incorrect mappings

Prompt sensitivity producing different pipelines for the same task

Core Entities

Models

GPT-4o

Metrics

AccuracyprecisionrecallF1

Datasets

Dou et al. cohort (used in demo)GDC (Genomic Data Commons) standard

Context Entities

Models

small LLMs for primitives (mentioned)LLM ensembles (discussion)

Metrics

similarity scores exposed by primitives

Datasets

various cancer cohort tables cited

Benchmarks

task-level benchmarks cited (schema matching, entity linking)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Harmonia produced perfect schema-matching on the evaluated use case.

LLM-augmented pipeline improved value-mapping F1 over the baseline primitives.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding