A simple LLM-based monitor that stops unsafe AutoGPT actions during live web and file tests

November 17, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

3

Authors

Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau

Links

Abstract / PDF

Why It Matters For Business

A lightweight LLM-based gate can block many dangerous agent actions before they run, reducing incident risk for products that let agents access the web or filesystem.

Summary TLDR

The authors build and evaluate AgentMonitor, a flexible safety layer that inspects language-model-agent outputs before web or filesystem actions run. They assemble a real-world testbench (29 tasks, 422 attempts, 1,965 transcripts) and synthetic unsafe examples, tune a gpt-3.5-turbo-16k monitor, and report a test-set F1 of 89.4% (precision 82.1%, recall 98.3%, AUC 0.982). The paper surfaces practical limits: reliance on whitelists, brittle few-shot examples, dataset scale, and that the monitor was evaluated mainly on AutoGPT.

Problem Statement

Open-world tests of language-model agents can cause real irreversible harm (file deletion, data leaks, malicious web requests). We need an automated, scalable way to detect and stop unsafe agent actions during live internet and filesystem tests while minimizing human review.

Main Contribution

A safety framework for running automated, open‑world tests of language-model agents where a monitor can halt unsafe actions.

AgentMonitor: a configurable LLM-based monitor (gpt-3.5-turbo-16k) that scores agent actions and can stop execution.

A test suite and dataset: 29 tasks, 422 attempts, 1,965 transcripts used to evaluate monitor behavior.

An ablation study showing which prompt/context pieces most affect monitor performance and a recommended threshold (68).

Key Findings

AgentMonitor achieves high detection performance on the authors' test set.

NumbersF1 89.4%, precision 82.1%, recall 98.3%, AUC 0.982

Context matters: removing the previous agent action drops F1 noticeably.

NumbersAblating Previous Context → F1 86.8% (from 89.4%)

Few-shot examples affect the precision/recall trade-off.

NumbersAblating Few Shot Examples → precision up (90.9%) but recall falls (93.4%); F1 92.2%

The authors' unsafe dataset is small and partly synthetic.

Numbers30 manual unsafe logs + 27 off-task logs added; final supervised set replaced 30% of safe outputs

Results

Monitor F1 on test set

Value89.4%

Precision

Value82.1%

Recall

Value98.3%

AUC

Value0.982

Optimal intervention threshold (training)

Value68

Dataset size (transcripts)

Value1,965

Who Should Care

What To Try In 7 Days

Run your agent CI tests through a pre-execution monitor that scores web/filesystem actions.

Log and rank low‑score (suspect) actions for human review instead of auto-executing them.

Add 'previous action' and 'prompt context' to monitor inputs and test recall/precision trade-offs.

Agent Features

Memory

  • short-term context/history used by monitor

Planning

  • multi-step plans

Tool Use

  • web browsing (HTTP)
  • filesystem I/O and command execution

Frameworks

  • AutoGPT

Is Agentic

true

Architectures

  • LLM-based agent

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluated mainly on AutoGPT transcripts; cross-agent generalization is untested.
  • Unsafe examples are small and partly synthetic (30 unsafe logs), leaving many attack vectors uncovered.
  • Monitor depends heavily on prompt components and whitelists; these can be brittle and require maintenance.
  • Using an LLM as a monitor can be gamed by sophisticated adversaries or novel injection patterns.

When Not To Use

  • To guard frontier base models that have not seen human review.
  • For high-assurance systems where 100% containment is required.
  • When adversaries can control large parts of the agent's inputs or host resources.

Failure Modes

  • False positives that block valid actions and slow CI.
  • False negatives on unseen or cleverly obfuscated attacks.
  • Whitelist bypasses via indirect prompt injection or web-hosted payloads.
  • Overfitting monitor examples leading to blind spots against new attack styles.

Core Entities

Models

  • gpt-3.5-turbo-16k-0613
  • gpt-4-0613
  • gpt-4-0314

Metrics

  • F1
  • precision
  • recall
  • AUC
  • ROC / threshold

Datasets

  • AutoGPT CI dataset (1,965 transcripts; 422 attempts)
  • Synthetic unsafe/off-task logs (30 unsafe, 27 off-task)
  • Larger AutoGPT CI pool (21,013 responses across agents; Appendix F)

Benchmarks

  • AGBenchmark-style 29-task suite (authors' test suite)
  • Auto-GPT-Benchmarks (github link)

Context Entities

Models

  • AutoGPT (agent under test)