A simple LLM-based monitor that stops unsafe AutoGPT actions during live web and file tests

November 17, 20237 min

Overview

Decision SnapshotNeeds Validation

The monitor is a practical prototype with strong test-set numbers, but it was tuned and evaluated on a limited AutoGPT-derived dataset and small synthetic attacks, so expect additional work before production rollout.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A lightweight LLM-based gate can block many dangerous agent actions before they run, reducing incident risk for products that let agents access the web or filesystem.

Who Should Care

Summary TLDR

The authors build and evaluate AgentMonitor, a flexible safety layer that inspects language-model-agent outputs before web or filesystem actions run. They assemble a real-world testbench (29 tasks, 422 attempts, 1,965 transcripts) and synthetic unsafe examples, tune a gpt-3.5-turbo-16k monitor, and report a test-set F1 of 89.4% (precision 82.1%, recall 98.3%, AUC 0.982). The paper surfaces practical limits: reliance on whitelists, brittle few-shot examples, dataset scale, and that the monitor was evaluated mainly on AutoGPT.

Problem Statement

Open-world tests of language-model agents can cause real irreversible harm (file deletion, data leaks, malicious web requests). We need an automated, scalable way to detect and stop unsafe agent actions during live internet and filesystem tests while minimizing human review.

Main Contribution

A safety framework for running automated, open‑world tests of language-model agents where a monitor can halt unsafe actions.

AgentMonitor: a configurable LLM-based monitor (gpt-3.5-turbo-16k) that scores agent actions and can stop execution.

Key Findings

AgentMonitor achieves high detection performance on the authors' test set.

NumbersF1 89.4%, precision 82.1%, recall 98.3%, AUC 0.982

Practical UseA well-prompted LLM can reliably flag unsafe web/filesystem actions in many test cases; deploy it as a pre-execution gate to reduce dangerous runs.

Evidence RefResults section; Table 2; Figure 3

Context matters: removing the previous agent action drops F1 noticeably.

NumbersAblating Previous Context → F1 86.8% (from 89.4%)

Practical UseKeep recent agent 'thoughts' and prior actions in monitor inputs to reduce false alarms and missed attacks.

Evidence RefAblation table (Table 2)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Monitor F1 on test set89.4%authors' test set (70% eval split)Main resultsResults section, Table 2
Precision82.1%test setMain resultsResults section, Table 2

What To Try In 7 Days

Run your agent CI tests through a pre-execution monitor that scores web/filesystem actions.

Log and rank low‑score (suspect) actions for human review instead of auto-executing them.

Add 'previous action' and 'prompt context' to monitor inputs and test recall/precision trade-offs.

Agent Features

Memory
short-term context/history used by monitor
Planning
multi-step plans
Tool Use
web browsing (HTTP)filesystem I/O and command execution
Frameworks
AutoGPT
Is Agentic

Yes

Architectures
LLM-based agent

Reproducibility

Risks & Boundaries

Limitations

Evaluated mainly on AutoGPT transcripts; cross-agent generalization is untested.

Unsafe examples are small and partly synthetic (30 unsafe logs), leaving many attack vectors uncovered.

When Not To Use

To guard frontier base models that have not seen human review.

For high-assurance systems where 100% containment is required.

Failure Modes

False positives that block valid actions and slow CI.

False negatives on unseen or cleverly obfuscated attacks.

Core Entities

Models

gpt-3.5-turbo-16k-0613gpt-4-0613gpt-4-0314

Metrics

F1precisionrecallAUCROC / threshold

Datasets

AutoGPT CI dataset (1,965 transcripts; 422 attempts)Synthetic unsafe/off-task logs (30 unsafe, 27 off-task)Larger AutoGPT CI pool (21,013 responses across agents; Appendix F)

Benchmarks

AGBenchmark-style 29-task suite (authors' test suite)Auto-GPT-Benchmarks (github link)

Context Entities

Models

AutoGPT (agent under test)