A simple LLM-based monitor that stops unsafe AutoGPT actions during live web and file tests

Overview

Decision SnapshotNeeds Validation

The monitor is a practical prototype with strong test-set numbers, but it was tuned and evaluated on a limited AutoGPT-derived dataset and small synthetic attacks, so expect additional work before production rollout.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A lightweight LLM-based gate can block many dangerous agent actions before they run, reducing incident risk for products that let agents access the web or filesystem.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The authors build and evaluate AgentMonitor, a flexible safety layer that inspects language-model-agent outputs before web or filesystem actions run. They assemble a real-world testbench (29 tasks, 422 attempts, 1,965 transcripts) and synthetic unsafe examples, tune a gpt-3.5-turbo-16k monitor, and report a test-set F1 of 89.4% (precision 82.1%, recall 98.3%, AUC 0.982). The paper surfaces practical limits: reliance on whitelists, brittle few-shot examples, dataset scale, and that the monitor was evaluated mainly on AutoGPT.

Problem Statement

Open-world tests of language-model agents can cause real irreversible harm (file deletion, data leaks, malicious web requests). We need an automated, scalable way to detect and stop unsafe agent actions during live internet and filesystem tests while minimizing human review.

Main Contribution

A safety framework for running automated, open‑world tests of language-model agents where a monitor can halt unsafe actions.

AgentMonitor: a configurable LLM-based monitor (gpt-3.5-turbo-16k) that scores agent actions and can stop execution.

Key Findings

AgentMonitor achieves high detection performance on the authors' test set.

NumbersF1 89.4%, precision 82.1%, recall 98.3%, AUC 0.982

Practical UseA well-prompted LLM can reliably flag unsafe web/filesystem actions in many test cases; deploy it as a pre-execution gate to reduce dangerous runs.

Evidence RefResults section; Table 2; Figure 3

Context matters: removing the previous agent action drops F1 noticeably.

NumbersAblating Previous Context → F1 86.8% (from 89.4%)

Practical UseKeep recent agent 'thoughts' and prior actions in monitor inputs to reduce false alarms and missed attacks.

Evidence RefAblation table (Table 2)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Monitor F1 on test set	89.4%	—	—	authors' test set (70% eval split)	Main results	Results section, Table 2
Precision	82.1%	—	—	test set	Main results	Results section, Table 2

What To Try In 7 Days

Run your agent CI tests through a pre-execution monitor that scores web/filesystem actions.

Log and rank low‑score (suspect) actions for human review instead of auto-executing them.

Add 'previous action' and 'prompt context' to monitor inputs and test recall/precision trade-offs.

Agent Features

Memory

short-term context/history used by monitor

Planning

multi-step plans

Tool Use

web browsing (HTTP)filesystem I/O and command execution

Frameworks

AutoGPT

Is Agentic

Yes

Architectures

LLM-based agent

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/tree/master/paper

Data URLs

https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/tree/master/paper

Risks & Boundaries

Limitations

Evaluated mainly on AutoGPT transcripts; cross-agent generalization is untested.

Unsafe examples are small and partly synthetic (30 unsafe logs), leaving many attack vectors uncovered.

When Not To Use

To guard frontier base models that have not seen human review.

For high-assurance systems where 100% containment is required.

Failure Modes

False positives that block valid actions and slow CI.

False negatives on unseen or cleverly obfuscated attacks.

Core Entities

Models

gpt-3.5-turbo-16k-0613gpt-4-0613gpt-4-0314

Metrics

F1precisionrecallAUCROC / threshold

Datasets

AutoGPT CI dataset (1,965 transcripts; 422 attempts)Synthetic unsafe/off-task logs (30 unsafe, 27 off-task)Larger AutoGPT CI pool (21,013 responses across agents; Appendix F)

Benchmarks

AGBenchmark-style 29-task suite (authors' test suite)Auto-GPT-Benchmarks (github link)

Context Entities

Models

AutoGPT (agent under test)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AgentMonitor achieves high detection performance on the authors' test set.

Context matters: removing the previous agent action drops F1 noticeably.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding