Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
Agentic harmonization can speed up combining heterogeneous datasets and produce reusable, publishable transformation scripts that improve reproducibility and reduce manual engineering time.
Summary TLDR
This paper proposes "agentic" data harmonization: LLM-based agents that interact with users and modular data-integration routines to synthesize reusable harmonization pipelines. The authors implement Harmonia, a prototype that combines a library of primitives (bdi-kit), an LLM agent (GPT-4o via Archytas), and a chat-style UI (Beaker). In a clinical use case mapping a cohort to the GDC standard, Harmonia outperforms baseline primitives (schema-matching accuracy 1.0 vs 0.88; value-mapping F1 0.68 vs 0.57). The paper discusses practical limits: LLM brittleness, context-window and evaluation gaps, and the need for provenance, uncertainty, and better benchmarks.
Problem Statement
Combining tables from different sources requires mapping column names and standardizing values. Existing work uses scripts and ad-hoc tools that are slow, brittle, and poorly documented. LLMs can help with language and code, but they are inconsistent, sensitive to prompts, and do not by themselves provide scalable, reproducible harmonization pipelines.
Main Contribution
Present a vision for agentic data harmonization that combines LLM reasoning, user interaction, and composable primitives.
Introduce Harmonia, a working prototype that integrates bdi-kit primitives, Archytas-based LLM tool-calling, and a Beaker chat UI.
Demonstrate a clinical use case mapping a real cohort to the GDC schema and produce a reusable declarative mapping.
Discuss open problems: evaluation benchmarks, uncertainty/explanations, robustness, provenance, and UI design.
Key Findings
Harmonia produced perfect schema-matching on the evaluated use case.
LLM-augmented pipeline improved value-mapping F1 over the baseline primitives.
LLMs are helpful but inconsistent: they sometimes fail to correct errors even with identical prompts.
Results
Accuracy
Value mapping F1
Who Should Care
What To Try In 7 Days
Clone the Harmonia repo and run the demo mapping a CSV to the GDC schema.
Replace one manual mapping script with a harmonization plan and materialize the mapping to save reproducible output.
Measure time and error rate vs your current manual harmonization process.
Agent Features
Memory
- Provenance DB (history of actions and interactions)
Planning
- tool calling
- task decomposition
- dynamic pipeline synthesis
Tool Use
- bdi-kit primitives
- on-demand Python code generation
- Archytas tool-calling
Frameworks
- ReAct
- Archytas
- Beaker
Is Agentic
true
Architectures
- LLM-driven agent loop
Collaboration
- two-way user-agent interaction
Optimization Features
Token Efficiency
- objective to minimize LLM calls (discussed)
Infra Optimization
- use of external Provenance DB to avoid context window limits
System Optimization
- balance objectives: correctness, user interactions, compute cost
Reproducibility
Data Urls
- https://gdc.cancer.gov/developers/gdc-data-model
- Dou et al. cohort (referenced in paper)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- LLM brittleness and inconsistent corrections across runs
- Context-window problems for large tables and long workflows
- Lack of end-to-end benchmarks to evaluate agentic harmonization
- Current UI is chat-based only; richer visual tools are needed
When Not To Use
- When you require fully automated, unattended harmonization for critical systems without human review
- On extremely large schemas without external storage to manage context
- When strict certification or audit trails are legally required unless provenance is integrated
Failure Modes
- LLM hallucinations leading to incorrect mappings
- Prompt sensitivity producing different pipelines for the same task
- Loss of earlier context causing wrong decisions in long workflows
- Over-reliance on LLM evaluation that reflects model bias
Core Entities
Models
- GPT-4o
Metrics
- Accuracy
- precision
- recall
- F1
Datasets
- Dou et al. cohort (used in demo)
- GDC (Genomic Data Commons) standard
Context Entities
Models
- small LLMs for primitives (mentioned)
- LLM ensembles (discussion)
Metrics
- similarity scores exposed by primitives
Datasets
- various cancer cohort tables cited
Benchmarks
- task-level benchmarks cited (schema matching, entity linking)

