Overview
Clear experimental evidence shows GPT-4 can carry out many web exploits in a sandbox; results are limited by withheld prompts/docs and a modest real-site sample.
Citations8
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/5
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 20%
Novelty: 60%
Why It Matters For Business
High-capability LLM agents can automate complex web attacks at lower estimated cost than manual analysts, increasing the risk surface for companies that expose web interfaces.
Who Should Care
Summary TLDR
The authors wrap LLMs in an agent that reads docs, calls functions (browser, terminal, Python), and plans multi-step attacks. GPT-4 (with document reading and a detailed prompt) autonomously exploited 11 of 15 sandboxed vulnerabilities (pass@5 73.3%, overall 42.7%). GPT-3.5 barely succeeds (6.7% pass@5). All tested open-source models were 0% successful. Ablations show document access and a strong system prompt are critical. A small real-world scan found one XSS. The authors withhold exact prompts/docs and most code for safety.
Problem Statement
Can modern LLM agents autonomously discover and exploit website vulnerabilities without human feedback? The paper tests whether tool-capable LLMs can plan multi-step hacks, use website feedback, and scale to real sites while measuring success rates and costs.
Main Contribution
Demonstrates an LLM-agent pipeline that autonomously performs web attacks using function calls, document reading, and planning.
Benchmarks 10 LLMs on 15 sandboxed web vulnerabilities and reports pass@5 and overall success rates.
Key Findings
GPT-4 agent succeeded on most sandboxed vulnerabilities
Open-source LLMs failed on the benchmark
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GPT-4 pass@5 | 73.3% | — | — | 15 sandbox vulnerabilities, 5 trials each | GPT-4 assistant achieved Pass@5 73.3% across the benchmark | Table 2; Section 4.2 |
| GPT-4 overall success rate | 42.7% | — | — | 15 sandbox vulnerabilities, 5 trials each | Overall success (at least one success in 5 trials) 42.7% | Table 2; Section 4.2 |
What To Try In 7 Days
Audit and restrict API function-calling endpoints and permissions.
Limit models' access to raw web documents or unfiltered internet content.
Instrument and log multi-step browser automation calls and long action sequences for anomaly detection of multi-call exploits (>=10 calls).
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Reproducibility
Risks & Boundaries
Limitations
Authors withhold detailed prompts and the exact documents and code for safety; hard to exactly reproduce.
Experiments run on sandboxed websites that may not capture full diversity of live web infra.
When Not To Use
As a blueprint to build offensive tools or automate penetration testing without explicit legal authorization.
To claim general real-world exploit prevalence from the limited real-site sample.
Failure Modes
Agent gets stuck on an initial unsuccessful strategy and fails to explore alternatives without the right prompt.
Open-source models often fail due to incorrect tool use or poor multi-turn planning.

