Design the lakehouse for agents first: solve concurrent runs with branching + isolated functions, and governance follows.

November 20, 20256 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Jacopo Tagliabue, Federico Bianchi, Ciro Greco

Links

Abstract / PDF

Why It Matters For Business

If you let agents mutate your lakehouse without transactional, runtime isolation, they can corrupt production data or leak secrets. Building a small, enforceable run API and sandboxed functions reduces risk and makes governance feasible.

Summary TLDR

Enterprises distrust autonomous agents because existing lakehouse infrastructure doesn't provide strong multi-step transaction and runtime isolation across heterogeneous tools. The paper proposes Bauplan: an agent-first lakehouse that (1) records immutable table snapshots and supports copy-on-write branching and atomic merges across multi-table pipelines, (2) runs each pipeline function in an isolated, network-blocked FaaS container, and (3) exposes a single declarative run API so agent runs publish to temporary branches and only merge on full success. This design restores transactional correctness for multi-node pipelines and makes governance practical through a small, checkable API surface

Problem Statement

Traditional lakehouses decouple storage and compute and support many runtimes. That decoupling breaks transactional guarantees across multi-step pipelines and inflates the attack surface for agents. Without new primitives, agents can leave the lake in inconsistent states or run untrusted code that harms production data.

Main Contribution

Diagnosis: explaining why MVCC (database transactions) cannot be transplanted naively to a decoupled, multi-runtime lakehouse.

Design: Bauplan — an agent-first lakehouse with copy-on-write branching, temporary runs, and atomic merges that span multiple tables and pipeline nodes.

Compute model: use FaaS-based, containerized, network-isolated functions per pipeline node to enforce runtime isolation and limit attack vectors.

Programming abstraction: declarative I/O (functions accept/output tables) plus a single run API that ties data branches and compute runs together.

Worked example: a self-healing pipeline pattern where an agent produces code and a verifier; the platform runs the verifier then human reviews before merge.

Key Findings

Multi-node pipelines need atomic commits across tables, not per-table transactions.

Branching with copy-on-write is efficient enough to handle large workloads.

Numbersdesigned for 'hundreds of tables and billions of rows'

Compute isolation is achieved by running each pipeline function in a network-isolated container (FaaS).

A single unified run API (bauplan.run) ties branching, data fetch, function execution and atomic merge into one flow.

Agent work can be made auditable and human-reviewed via data-branch outputs and verifiers.

Who Should Care

What To Try In 7 Days

Run a prototype pipeline: put agent outputs into a temporary branch and practice merge-on-success.

Sandbox a single pipeline node in a network-blocked container and test package whitelisting.

Map current pipelines to a declarative I/O interface (functions that accept/table return tables) to see gaps.

Agent Features

Memory

  • branch snapshots (immutable table history)

Planning

  • ReAct loop
  • pipeline orchestration via unified run API

Tool Use

  • containerized runtimes
  • package whitelisting
  • no-internet sandboxing

Frameworks

  • Bauplan

Is Agentic

true

Architectures

  • FaaS
  • branch-and-merge (git-like)
  • declarative I/O

Collaboration

  • human-in-the-loop verification
  • branch-review-merge workflow

Optimization Features

Infra Optimization

  • centralized run API enabling platform-side optimizations

System Optimization

  • copy-on-write branching to avoid full data copies

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Position paper without quantitative benchmarks or experiments.
  • Relies on a central platform controlling run API and FaaS; not directly applicable to fragmented/legacy infra.
  • Package whitelisting and network isolation reduce risk but may limit some valid workloads.

When Not To Use

  • You cannot change the platform or impose a single run API.
  • Workloads require direct internet access from runtime.
  • You must support many legacy jobs that cannot be containerized easily.

Failure Modes

  • Merge conflicts across branches that require manual resolution and delay deployment.
  • Verifier false negatives — automated checks miss edge-case failures.
  • Malicious or buggy packages slip through whitelist or via supply-chain vectors.