GPT-4 agents autonomously exploit sandboxed website vulnerabilities (11/15) and find at least one real XSS

Overview

Decision SnapshotNeeds Validation

Clear experimental evidence shows GPT-4 can carry out many web exploits in a sandbox; results are limited by withheld prompts/docs and a modest real-site sample.

Citations8

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 20%

Novelty: 60%

Authors

Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, Daniel Kang

Links

Abstract / PDF

Why It Matters For Business

High-capability LLM agents can automate complex web attacks at lower estimated cost than manual analysts, increasing the risk surface for companies that expose web interfaces.

Who Should Care

CTO CEO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The authors wrap LLMs in an agent that reads docs, calls functions (browser, terminal, Python), and plans multi-step attacks. GPT-4 (with document reading and a detailed prompt) autonomously exploited 11 of 15 sandboxed vulnerabilities (pass@5 73.3%, overall 42.7%). GPT-3.5 barely succeeds (6.7% pass@5). All tested open-source models were 0% successful. Ablations show document access and a strong system prompt are critical. A small real-world scan found one XSS. The authors withhold exact prompts/docs and most code for safety.

Problem Statement

Can modern LLM agents autonomously discover and exploit website vulnerabilities without human feedback? The paper tests whether tool-capable LLMs can plan multi-step hacks, use website feedback, and scale to real sites while measuring success rates and costs.

Main Contribution

Demonstrates an LLM-agent pipeline that autonomously performs web attacks using function calls, document reading, and planning.

Benchmarks 10 LLMs on 15 sandboxed web vulnerabilities and reports pass@5 and overall success rates.

Key Findings

GPT-4 agent succeeded on most sandboxed vulnerabilities

NumbersPass@5 = 73.3%; overall success = 42.7% (Table 2)

Practical UseFrontier LLMs can automate multi-step web exploits; security teams should assume such automation is feasible and test defenses accordingly.

Evidence RefTable 2; Section 4.2

Open-source LLMs failed on the benchmark

Numbers0% pass@5 for all tested open-source models (Table 2)

Practical UseCurrent open-source models lack reliable tool use and multi-step planning for autonomous hacks, but targeted tuning could close this gap.

Evidence RefTable 2; Section 4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GPT-4 pass@5	73.3%	—	—	15 sandbox vulnerabilities, 5 trials each	GPT-4 assistant achieved Pass@5 73.3% across the benchmark	Table 2; Section 4.2
GPT-4 overall success rate	42.7%	—	—	15 sandbox vulnerabilities, 5 trials each	Overall success (at least one success in 5 trials) 42.7%	Table 2; Section 4.2

What To Try In 7 Days

Audit and restrict API function-calling endpoints and permissions.

Limit models' access to raw web documents or unfiltered internet content.

Instrument and log multi-step browser automation calls and long action sequences for anomaly detection of multi-call exploits (>=10 calls).

Agent Features

Memory

short-term context (conversation and recent actions)extended context across many steps required for blind SQL union

Planning

assistant-API planning (iterative, backtracking)multi-step exploit planning

Tool Use

function callingheadless browser automation (Playwright)terminal (curl)Python code execution

Frameworks

LangChainOpenAI Assistants APITogether AI APIPlaywright

Is Agentic

Yes

Architectures

GPT-4GPT-3.5various open-source LLMs

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Authors withhold detailed prompts and the exact documents and code for safety; hard to exactly reproduce.

Experiments run on sandboxed websites that may not capture full diversity of live web infra.

When Not To Use

As a blueprint to build offensive tools or automate penetration testing without explicit legal authorization.

To claim general real-world exploit prevalence from the limited real-site sample.

Failure Modes

Agent gets stuck on an initial unsuccessful strategy and fails to explore alternatives without the right prompt.

Open-source models often fail due to incorrect tool use or poor multi-turn planning.

Core Entities

Models

GPT-4GPT-3.5OpenHermes-2.5-Mistral-7BLLaMA-2 Chat 70BLLaMA-2 Chat 13BLLaMA-2 Chat 7BMixtral-8x7BMistral-7BNous Hermes-2 Yi 34BOpenChat 3.5

Metrics

pass@5overall success rateavg function calls per successful hacktoken cost per run

Datasets

web-vuln-15-sandbox

Benchmarks

15-web-vulnerabilities-sandbox

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 agent succeeded on most sandboxed vulnerabilities

Open-source LLMs failed on the benchmark

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

Key finding

R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

Key finding

A single LLM can role-play homogeneous multi-agent workflows and cut inference cost via KV-cache reuse

Key finding

DeceptGuard: detect agent deception by reading CoT text and activation probes

Key finding