Design the lakehouse for agents first: solve concurrent runs with branching + isolated functions, and governance follows.

Overview

Decision SnapshotNeeds Validation

The ideas map known DB primitives (MVCC, branching) into lakehouse design and provide a reference implementation. However, the paper is a position/system design with limited empirical validation, so practical payoff depends on integration effort and platform maturity.

Citations0

Evidence Strength0.40

Confidence0.70

Risk Signals9

Trust Signals

Findings with numeric evidence: 1/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Jacopo Tagliabue, Federico Bianchi, Ciro Greco

Links

Abstract / PDF / Code

Why It Matters For Business

If you let agents mutate your lakehouse without transactional, runtime isolation, they can corrupt production data or leak secrets. Building a small, enforceable run API and sandboxed functions reduces risk and makes governance feasible.

Who Should Care

CTO Engineering Lead ML Engineer Data Scientist Product Manager Founder

Summary TLDR

Enterprises distrust autonomous agents because existing lakehouse infrastructure doesn't provide strong multi-step transaction and runtime isolation across heterogeneous tools. The paper proposes Bauplan: an agent-first lakehouse that (1) records immutable table snapshots and supports copy-on-write branching and atomic merges across multi-table pipelines, (2) runs each pipeline function in an isolated, network-blocked FaaS container, and (3) exposes a single declarative run API so agent runs publish to temporary branches and only merge on full success. This design restores transactional correctness for multi-node pipelines and makes governance practical through a small, checkable API surface

Problem Statement

Traditional lakehouses decouple storage and compute and support many runtimes. That decoupling breaks transactional guarantees across multi-step pipelines and inflates the attack surface for agents. Without new primitives, agents can leave the lake in inconsistent states or run untrusted code that harms production data.

Main Contribution

Diagnosis: explaining why MVCC (database transactions) cannot be transplanted naively to a decoupled, multi-runtime lakehouse.

Design: Bauplan — an agent-first lakehouse with copy-on-write branching, temporary runs, and atomic merges that span multiple tables and pipeline nodes.

Key Findings

Multi-node pipelines need atomic commits across tables, not per-table transactions.

Practical UseTreat a pipeline run as a multi-table transaction: use temporary branches for a run and merge atomically on success to avoid inconsistent main state.

Evidence RefSection 3.1 and Figure 2: single-table guarantees let downstream readers see a混t

Branching with copy-on-write is efficient enough to handle large workloads.

Numbersdesigned for 'hundreds of tables and billions of rows'

Practical UseUse snapshot-based branching (git-like) rather than locking per-table; this scales to large tables without full data copies.

Evidence RefSection 4.1: 'copy-on-write branching... even on hundreds of tables and billions

What To Try In 7 Days

Run a prototype pipeline: put agent outputs into a temporary branch and practice merge-on-success.

Sandbox a single pipeline node in a network-blocked container and test package whitelisting.

Map current pipelines to a declarative I/O interface (functions that accept/table return tables) to see gaps.

Agent Features

Memory

branch snapshots (immutable table history)

Planning

ReAct looppipeline orchestration via unified run API

Tool Use

containerized runtimespackage whitelistingno-internet sandboxing

Frameworks

Bauplan

Is Agentic

Yes

Architectures

FaaSbranch-and-merge (git-like)declarative I/O

Collaboration

human-in-the-loop verificationbranch-review-merge workflow

Optimization Features

Infra Optimization

centralized run API enabling platform-side optimizations

System Optimization

copy-on-write branching to avoid full data copies

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/BauplanLabs/the-agentic-lakehouse

Risks & Boundaries

Limitations

Position paper without quantitative benchmarks or experiments.

Relies on a central platform controlling run API and FaaS; not directly applicable to fragmented/legacy infra.

When Not To Use

You cannot change the platform or impose a single run API.

Workloads require direct internet access from runtime.

Failure Modes

Merge conflicts across branches that require manual resolution and delay deployment.

Verifier false negatives — automated checks miss edge-case failures.

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Multi-node pipelines need atomic commits across tables, not per-table transactions.

Branching with copy-on-write is efficient enough to handle large workloads.

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

You May Also Want to Read

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding