Make tool-using LLM agents provably safe by combining safety engineering, info-flow labels, and MCP extensions

January 12, 20266 min

Overview

Decision SnapshotNeeds Validation

The idea is practical and supported by an Alloy proof-of-concept, but lacks a production implementation and large-scale evaluation; labels, trust, and integration remain open engineering work.

Citations0

Evidence Strength0.50

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 0/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/1

Reproducibility

Status: No open assets linked

Open source: Unknown

License: Creative Commons Attribution 4.0 (paper)

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 60%

Authors

Aarya Doshi, Yining Hong, Congying Xu, Eunsuk Kang, Alexandros Kapravelos, Christian Kästner

Links

Abstract / PDF

Why It Matters For Business

Deterministic guardrails reduce unacceptable risks in enterprise agents (data leaks, unauthorized writes) and let teams choose autonomy levels with verifiable safety.

Who Should Care

Summary TLDR

LLM agents can compose tool calls in ways that leak data or perform unsafe actions. The paper proposes a practical process: use STPA (a systems safety method) to find hazards, convert requirements into formal, enforceable specifications for data flows and tool sequences, and extend the Model Context Protocol (MCP) so tools and data carry structured labels (confidentiality, capabilities, trust). A four-tier enforcement model (blocklist/allowlist/mustlist/confirmation) is applied at tool boundaries. An Alloy model shows these label-based policies can deterministically block unsafe traces while keeping safe behaviors. The work is a concrete blueprint and a proof-of-concept; an engine to enforce

Problem Statement

LLM agents that call external APIs and tools can accidentally or adversarially leak sensitive data or perform harmful actions. Existing model-based guards improve detection but not guarantees. We need a practical way to (1) identify interaction hazards and (2) enforce deterministic constraints on tool ordering and data flows at run time, without excessive manual labeling or constant user prompts.

Main Contribution

Adapt STPA safety analysis to identify interaction hazards and derive agent safety requirements.

Turn safety requirements into symbolic specifications enforceable at tool boundaries (data-flow and temporal constraints).

Key Findings

A bounded Alloy model can prove that label-based policies eliminate unsafe flows that otherwise occur.

Practical UseUse a formal analyzer (Alloy or similar) to check your agent+tool spec; properly specified label policies can deterministically prevent leaks in evaluated traces.

Evidence RefSection 5: Preliminary Results (Alloy analysis)

The existing MCP provides only minimal, optional annotations, which are insufficient for deterministic enforcement.

Practical UseAugment MCP with required structured labels before relying on it for security checks.

Evidence RefSection 1 and 4.3 (MCP limitations)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
safety violation eliminationverified in bounded Alloy model when policies enforcedunsafe traces exist without policiesAlloy bounded execution traces (toy calendar agent model)Alloy analysis found counterexamples without policies and no violations when policies and sanitation steps enforcedSection 5

What To Try In 7 Days

Run a quick STPA-style hazard brainstorm on one task-specific agent (e.g., calendar or CRM flow).

Add three minimal MCP tags (confidentiality, capabilities, trust) to your tool registry for that agent.

Prototype an interceptor that blocks send_email when data is labeled private and target is external.

Agent Features

Planning
LLM plans tool calls each loop
Tool Use
intercepted tool calls at runtimetool sequencing constraints (temporal rules)data-label based blocking for external writes
Frameworks
Model Context Protocol (MCP)
Is Agentic

Yes

Architectures
LLM-based agent

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseCreative Commons Attribution 4.0 (paper)

Risks & Boundaries

Limitations

Requires reliable labels on data and tools; labeling can be costly or spoofable.

MCP metadata may be untrusted in open markets; must limit to certified tools or add verification.

When Not To Use

For general-purpose agents with unknown tools and broad scope (hard to specify safety rules).

In low-latency, high-throughput contexts where interception adds unacceptable delay.

Failure Modes

Incorrect or forged labels let unsafe flows bypass checks.

Overly strict policies block needed functionality and hurt adoption.