Make tool-using LLM agents provably safe by combining safety engineering, info-flow labels, and MCP extensions

Overview

Decision SnapshotNeeds Validation

The idea is practical and supported by an Alloy proof-of-concept, but lacks a production implementation and large-scale evaluation; labels, trust, and integration remain open engineering work.

Citations0

Evidence Strength0.50

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 0/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/1

Reproducibility

Status: No open assets linked

Open source: Unknown

License: Creative Commons Attribution 4.0 (paper)

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 60%

Authors

Aarya Doshi, Yining Hong, Congying Xu, Eunsuk Kang, Alexandros Kapravelos, Christian Kästner

Links

Abstract / PDF

Why It Matters For Business

Deterministic guardrails reduce unacceptable risks in enterprise agents (data leaks, unauthorized writes) and let teams choose autonomy levels with verifiable safety.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead

Summary TLDR

LLM agents can compose tool calls in ways that leak data or perform unsafe actions. The paper proposes a practical process: use STPA (a systems safety method) to find hazards, convert requirements into formal, enforceable specifications for data flows and tool sequences, and extend the Model Context Protocol (MCP) so tools and data carry structured labels (confidentiality, capabilities, trust). A four-tier enforcement model (blocklist/allowlist/mustlist/confirmation) is applied at tool boundaries. An Alloy model shows these label-based policies can deterministically block unsafe traces while keeping safe behaviors. The work is a concrete blueprint and a proof-of-concept; an engine to enforce

Problem Statement

LLM agents that call external APIs and tools can accidentally or adversarially leak sensitive data or perform harmful actions. Existing model-based guards improve detection but not guarantees. We need a practical way to (1) identify interaction hazards and (2) enforce deterministic constraints on tool ordering and data flows at run time, without excessive manual labeling or constant user prompts.

Main Contribution

Adapt STPA safety analysis to identify interaction hazards and derive agent safety requirements.

Turn safety requirements into symbolic specifications enforceable at tool boundaries (data-flow and temporal constraints).

Key Findings

A bounded Alloy model can prove that label-based policies eliminate unsafe flows that otherwise occur.

Practical UseUse a formal analyzer (Alloy or similar) to check your agent+tool spec; properly specified label policies can deterministically prevent leaks in evaluated traces.

Evidence RefSection 5: Preliminary Results (Alloy analysis)

The existing MCP provides only minimal, optional annotations, which are insufficient for deterministic enforcement.

Practical UseAugment MCP with required structured labels before relying on it for security checks.

Evidence RefSection 1 and 4.3 (MCP limitations)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
safety violation elimination	verified in bounded Alloy model when policies enforced	unsafe traces exist without policies	—	Alloy bounded execution traces (toy calendar agent model)	Alloy analysis found counterexamples without policies and no violations when policies and sanitation steps enforced	Section 5

What To Try In 7 Days

Run a quick STPA-style hazard brainstorm on one task-specific agent (e.g., calendar or CRM flow).

Add three minimal MCP tags (confidentiality, capabilities, trust) to your tool registry for that agent.

Prototype an interceptor that blocks send_email when data is labeled private and target is external.

Agent Features

Planning

LLM plans tool calls each loop

Tool Use

intercepted tool calls at runtimetool sequencing constraints (temporal rules)data-label based blocking for external writes

Frameworks

Model Context Protocol (MCP)

Is Agentic

Yes

Architectures

LLM-based agent

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseCreative Commons Attribution 4.0 (paper)

Risks & Boundaries

Limitations

Requires reliable labels on data and tools; labeling can be costly or spoofable.

MCP metadata may be untrusted in open markets; must limit to certified tools or add verification.

When Not To Use

For general-purpose agents with unknown tools and broad scope (hard to specify safety rules).

In low-latency, high-throughput contexts where interception adds unacceptable delay.

Failure Modes

Incorrect or forged labels let unsafe flows bypass checks.

Overly strict policies block needed functionality and hurt adoption.

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A bounded Alloy model can prove that label-based policies eliminate unsafe flows that otherwise occur.

The existing MCP provides only minimal, optional annotations, which are insufficient for deterministic enforcement.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

You May Also Want to Read

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding

A 1,000-task, real-server benchmark that measures how well LLMs discover and use tools

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

A runnable benchmark of 760 real financial tools and 295 tool-required questions for auditing LLM agents

Key finding