Agentless: a simple three-step workflow (localize, repair, validate) that matches or beats open-source agents on SWE-bench Lite while slasH‑

Overview

Decision SnapshotReady For Pilot

The approach is practical and low-cost; evidence comes from multiple datasets and ablations, but results depend on LLM quality and benchmark sanitization.

Citations13

Evidence Strength0.80

Confidence0.90

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, Lingming Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A focused, non-agentic pipeline cuts cost and engineering overhead while matching or exceeding many open-source agentic systems on repo-level bug fixes.

Who Should Care

ML Engineer Engineering Lead Product Manager CTO

Summary TLDR

The paper shows a lightweight, non-agentic pipeline (AGENTLESS) for fixing real GitHub issues: hierarchical localization (file → function → edit), LLM-based patch sampling in a small diff format, and LLM-generated reproduction tests plus regression testing for selection. On SWE-bench Lite (300 problems) AGENTLESS fixes 96 issues (32.00%) at an average cost of $0.70, outperforming prior open-source agentic tools while being simpler and cheaper. The authors also hand-audit SWE-bench Lite, remove problematic cases, and publish a filtered set (SWE-bench LiteS).

Problem Statement

Current LLM agent frameworks are complex, costly, and fragile. The paper asks: can a simple, non-agentic pipeline (no autonomous tool use or multi-turn planning) match or beat agent-based approaches on real repo-level coding tasks?

Main Contribution

AGENTLESS: a three-phase agentless pipeline (hierarchical localization, patch sampling with simple diff edits, and validation via reproduction + regression tests).

Empirical evaluation on SWE-bench Lite showing 32.00% resolved (96/300) at $0.70 average cost, competitive with or better than open-source agents.

Key Findings

AGENTLESS resolves 96 of 300 SWE-bench Lite problems

Numbers96/300 = 32.00%

Practical UseA compact pipeline can match or beat open-source agent tools; try a non-agentic approach first to save cost and engineering time.

Evidence RefTable 1; Sec. 5.1

Average inference cost per issue is low

NumbersAvg. $ = $0.70

Practical UseYou can run large-scale repo-level experiments cheaply; prefer targeted small edits and sampling over expensive multi-turn agents.

Evidence RefTable 1; Sec. 5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
%Resolved	32.00% (96/300)	—	—	SWE-bench Lite	Table 1; Sec. 5.1	Table 1
Avg. $ Cost	$0.70	—	—	SWE-bench Lite	Table 1; Sec. 4	Table 1

What To Try In 7 Days

Run AGENTLESS-style pipeline on a small set of repo issues: localize → sample diff patches → validate with regression + generated tests.

Add a lightweight embedding retrieval step (chunk embeddings via OpenAI) and a file-skeleton prompt to reduce LLM context size.

Audit your in-house bug reports for exact-patch leaks or missing info; filter them before model evaluation.

Agent Features

Memory

no long-term retrieval memory

Tool Use

no autonomous tool executionno multi-turn action planning

Frameworks

LlamaIndexOpenAI APIs

Architectures

prompting + embedding retrievalhierarchical localization (file → skeleton → edit)

Optimization Features

Token Efficiency

skeleton format to reduce context sizesearch/replace diffs to avoid re-generating full files

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/OpenAutoCoder/Agentless

Data URLs

https://www.swebench.com/lite.html https://openai.com/index/introducing-swe-bench-verified/

Risks & Boundaries

Limitations

Performance drops on problems with no location clues; agentic tools with search tools still do better there.

Generated reproduction tests are imperfect: many reproduce the bug but fewer can validate fixes.

When Not To Use

When you need agents to run complex toolchains or perform multi-step environment interactions.

When issue descriptions lack any location hints and you need aggressive repository-wide search tools.

Failure Modes

LLM is distracted by long file contents if skeleton compression is not used.

Incorrect reproduction tests can bias patch selection if regression tests are weak.

Core Entities

Models

GPT-4o (gpt-4o-2024-05-13)text-embedding-3-small (OpenAI)

Metrics

% ResolvedAvg. $ CostAvg. # Tokens% Correct Location (line/function/file)

Datasets

SWE-bench Lite (300 problems)SWE-bench LiteS (249 filtered problems)SWE-bench Verified (500 issues, OpenAI)

Benchmarks

SWE-bench LiteSWE-bench LiteSSWE-bench Verified

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AGENTLESS resolves 96 of 300 SWE-bench Lite problems

Average inference cost per issue is low

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

Key finding

R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

Key finding

A single LLM can role-play homogeneous multi-agent workflows and cut inference cost via KV-cache reuse

Key finding

DeceptGuard: detect agent deception by reading CoT text and activation probes

Key finding