Agentless: a simple three-step workflow (localize, repair, validate) that matches or beats open-source agents on SWE-bench Lite while slasH‑

July 1, 20247 min

Overview

Decision SnapshotReady For Pilot

The approach is practical and low-cost; evidence comes from multiple datasets and ablations, but results depend on LLM quality and benchmark sanitization.

Citations13

Evidence Strength0.80

Confidence0.90

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, Lingming Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A focused, non-agentic pipeline cuts cost and engineering overhead while matching or exceeding many open-source agentic systems on repo-level bug fixes.

Who Should Care

Summary TLDR

The paper shows a lightweight, non-agentic pipeline (AGENTLESS) for fixing real GitHub issues: hierarchical localization (file → function → edit), LLM-based patch sampling in a small diff format, and LLM-generated reproduction tests plus regression testing for selection. On SWE-bench Lite (300 problems) AGENTLESS fixes 96 issues (32.00%) at an average cost of $0.70, outperforming prior open-source agentic tools while being simpler and cheaper. The authors also hand-audit SWE-bench Lite, remove problematic cases, and publish a filtered set (SWE-bench LiteS).

Problem Statement

Current LLM agent frameworks are complex, costly, and fragile. The paper asks: can a simple, non-agentic pipeline (no autonomous tool use or multi-turn planning) match or beat agent-based approaches on real repo-level coding tasks?

Main Contribution

AGENTLESS: a three-phase agentless pipeline (hierarchical localization, patch sampling with simple diff edits, and validation via reproduction + regression tests).

Empirical evaluation on SWE-bench Lite showing 32.00% resolved (96/300) at $0.70 average cost, competitive with or better than open-source agents.

Key Findings

AGENTLESS resolves 96 of 300 SWE-bench Lite problems

Numbers96/300 = 32.00%

Practical UseA compact pipeline can match or beat open-source agent tools; try a non-agentic approach first to save cost and engineering time.

Evidence RefTable 1; Sec. 5.1

Average inference cost per issue is low

NumbersAvg. $ = $0.70

Practical UseYou can run large-scale repo-level experiments cheaply; prefer targeted small edits and sampling over expensive multi-turn agents.

Evidence RefTable 1; Sec. 5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
%Resolved32.00% (96/300)SWE-bench LiteTable 1; Sec. 5.1Table 1
Avg. $ Cost$0.70SWE-bench LiteTable 1; Sec. 4Table 1

What To Try In 7 Days

Run AGENTLESS-style pipeline on a small set of repo issues: localize → sample diff patches → validate with regression + generated tests.

Add a lightweight embedding retrieval step (chunk embeddings via OpenAI) and a file-skeleton prompt to reduce LLM context size.

Audit your in-house bug reports for exact-patch leaks or missing info; filter them before model evaluation.

Agent Features

Memory
no long-term retrieval memory
Tool Use
no autonomous tool executionno multi-turn action planning
Frameworks
LlamaIndexOpenAI APIs
Architectures
prompting + embedding retrievalhierarchical localization (file → skeleton → edit)

Optimization Features

Token Efficiency
skeleton format to reduce context sizesearch/replace diffs to avoid re-generating full files

Reproducibility

Risks & Boundaries

Limitations

Performance drops on problems with no location clues; agentic tools with search tools still do better there.

Generated reproduction tests are imperfect: many reproduce the bug but fewer can validate fixes.

When Not To Use

When you need agents to run complex toolchains or perform multi-step environment interactions.

When issue descriptions lack any location hints and you need aggressive repository-wide search tools.

Failure Modes

LLM is distracted by long file contents if skeleton compression is not used.

Incorrect reproduction tests can bias patch selection if regression tests are weak.

Core Entities

Models

GPT-4o (gpt-4o-2024-05-13)text-embedding-3-small (OpenAI)

Metrics

% ResolvedAvg. $ CostAvg. # Tokens% Correct Location (line/function/file)

Datasets

SWE-bench Lite (300 problems)SWE-bench LiteS (249 filtered problems)SWE-bench Verified (500 issues, OpenAI)

Benchmarks

SWE-bench LiteSWE-bench LiteSSWE-bench Verified