GPT-4 agents autonomously exploit sandboxed website vulnerabilities (11/15) and find at least one real XSS

February 6, 20247 min

Overview

Decision SnapshotNeeds Validation

Clear experimental evidence shows GPT-4 can carry out many web exploits in a sandbox; results are limited by withheld prompts/docs and a modest real-site sample.

Citations8

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 20%

Novelty: 60%

Authors

Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, Daniel Kang

Links

Abstract / PDF

Why It Matters For Business

High-capability LLM agents can automate complex web attacks at lower estimated cost than manual analysts, increasing the risk surface for companies that expose web interfaces.

Who Should Care

Summary TLDR

The authors wrap LLMs in an agent that reads docs, calls functions (browser, terminal, Python), and plans multi-step attacks. GPT-4 (with document reading and a detailed prompt) autonomously exploited 11 of 15 sandboxed vulnerabilities (pass@5 73.3%, overall 42.7%). GPT-3.5 barely succeeds (6.7% pass@5). All tested open-source models were 0% successful. Ablations show document access and a strong system prompt are critical. A small real-world scan found one XSS. The authors withhold exact prompts/docs and most code for safety.

Problem Statement

Can modern LLM agents autonomously discover and exploit website vulnerabilities without human feedback? The paper tests whether tool-capable LLMs can plan multi-step hacks, use website feedback, and scale to real sites while measuring success rates and costs.

Main Contribution

Demonstrates an LLM-agent pipeline that autonomously performs web attacks using function calls, document reading, and planning.

Benchmarks 10 LLMs on 15 sandboxed web vulnerabilities and reports pass@5 and overall success rates.

Key Findings

GPT-4 agent succeeded on most sandboxed vulnerabilities

NumbersPass@5 = 73.3%; overall success = 42.7% (Table 2)

Practical UseFrontier LLMs can automate multi-step web exploits; security teams should assume such automation is feasible and test defenses accordingly.

Evidence RefTable 2; Section 4.2

Open-source LLMs failed on the benchmark

Numbers0% pass@5 for all tested open-source models (Table 2)

Practical UseCurrent open-source models lack reliable tool use and multi-step planning for autonomous hacks, but targeted tuning could close this gap.

Evidence RefTable 2; Section 4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GPT-4 pass@573.3%15 sandbox vulnerabilities, 5 trials eachGPT-4 assistant achieved Pass@5 73.3% across the benchmarkTable 2; Section 4.2
GPT-4 overall success rate42.7%15 sandbox vulnerabilities, 5 trials eachOverall success (at least one success in 5 trials) 42.7%Table 2; Section 4.2

What To Try In 7 Days

Audit and restrict API function-calling endpoints and permissions.

Limit models' access to raw web documents or unfiltered internet content.

Instrument and log multi-step browser automation calls and long action sequences for anomaly detection of multi-call exploits (>=10 calls).

Agent Features

Memory
short-term context (conversation and recent actions)extended context across many steps required for blind SQL union
Planning
assistant-API planning (iterative, backtracking)multi-step exploit planning
Tool Use
function callingheadless browser automation (Playwright)terminal (curl)Python code execution
Frameworks
LangChainOpenAI Assistants APITogether AI APIPlaywright
Is Agentic

Yes

Architectures
GPT-4GPT-3.5various open-source LLMs

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Authors withhold detailed prompts and the exact documents and code for safety; hard to exactly reproduce.

Experiments run on sandboxed websites that may not capture full diversity of live web infra.

When Not To Use

As a blueprint to build offensive tools or automate penetration testing without explicit legal authorization.

To claim general real-world exploit prevalence from the limited real-site sample.

Failure Modes

Agent gets stuck on an initial unsuccessful strategy and fails to explore alternatives without the right prompt.

Open-source models often fail due to incorrect tool use or poor multi-turn planning.

Core Entities

Models

GPT-4GPT-3.5OpenHermes-2.5-Mistral-7BLLaMA-2 Chat 70BLLaMA-2 Chat 13BLLaMA-2 Chat 7BMixtral-8x7BMistral-7BNous Hermes-2 Yi 34BOpenChat 3.5

Metrics

pass@5overall success rateavg function calls per successful hacktoken cost per run

Datasets

web-vuln-15-sandbox

Benchmarks

15-web-vulnerabilities-sandbox