Agentless: a simple three-step workflow (localize, repair, validate) that matches or beats open-source agents on SWE-bench Lite while slasH‑

July 1, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

13

Authors

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, Lingming Zhang

Links

Abstract / PDF

Why It Matters For Business

A focused, non-agentic pipeline cuts cost and engineering overhead while matching or exceeding many open-source agentic systems on repo-level bug fixes.

Summary TLDR

The paper shows a lightweight, non-agentic pipeline (AGENTLESS) for fixing real GitHub issues: hierarchical localization (file → function → edit), LLM-based patch sampling in a small diff format, and LLM-generated reproduction tests plus regression testing for selection. On SWE-bench Lite (300 problems) AGENTLESS fixes 96 issues (32.00%) at an average cost of $0.70, outperforming prior open-source agentic tools while being simpler and cheaper. The authors also hand-audit SWE-bench Lite, remove problematic cases, and publish a filtered set (SWE-bench LiteS).

Problem Statement

Current LLM agent frameworks are complex, costly, and fragile. The paper asks: can a simple, non-agentic pipeline (no autonomous tool use or multi-turn planning) match or beat agent-based approaches on real repo-level coding tasks?

Main Contribution

AGENTLESS: a three-phase agentless pipeline (hierarchical localization, patch sampling with simple diff edits, and validation via reproduction + regression tests).

Empirical evaluation on SWE-bench Lite showing 32.00% resolved (96/300) at $0.70 average cost, competitive with or better than open-source agents.

Manual audit of SWE-bench Lite that finds problematic items (exact patches, missing info, misleading solutions) and a cleaned subset called SWE-bench LiteS.

Key Findings

AGENTLESS resolves 96 of 300 SWE-bench Lite problems

Numbers96/300 = 32.00%

Average inference cost per issue is low

NumbersAvg. $ = $0.70

Generated reproduction tests often reproduce the issue but rarely fully validate fixes

Numbers213 reproduction tests reproduced; only 94 validated ground-truth fixes

SWE-bench Lite contains problematic cases that bias evaluation

Numbers4.3% exact patch in description; 10.0% missing info; 5.0% misleading solutions

Results

%Resolved

Value32.00% (96/300)

Avg. $ Cost

Value$0.70

Avg. # Tokens

Value78,166 (input+output)

%Correct Location (file)

Value69.7%

Reproduction tests that reproduce issue

Value213 / 300

Reproduction tests that validate ground-truth fixes

Value94 / 300

Who Should Care

What To Try In 7 Days

Run AGENTLESS-style pipeline on a small set of repo issues: localize → sample diff patches → validate with regression + generated tests.

Add a lightweight embedding retrieval step (chunk embeddings via OpenAI) and a file-skeleton prompt to reduce LLM context size.

Audit your in-house bug reports for exact-patch leaks or missing info; filter them before model evaluation.

Agent Features

Memory

  • no long-term retrieval memory

Tool Use

  • no autonomous tool execution
  • no multi-turn action planning

Frameworks

  • LlamaIndex
  • OpenAI APIs

Architectures

  • prompting + embedding retrieval
  • hierarchical localization (file → skeleton → edit)

Optimization Features

Token Efficiency

  • skeleton format to reduce context size
  • search/replace diffs to avoid re-generating full files

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Performance drops on problems with no location clues; agentic tools with search tools still do better there.
  • Generated reproduction tests are imperfect: many reproduce the bug but fewer can validate fixes.
  • Uses closed LLMs (GPT-4o); potential training-data leakage cannot be fully ruled out.

When Not To Use

  • When you need agents to run complex toolchains or perform multi-step environment interactions.
  • When issue descriptions lack any location hints and you need aggressive repository-wide search tools.

Failure Modes

  • LLM is distracted by long file contents if skeleton compression is not used.
  • Incorrect reproduction tests can bias patch selection if regression tests are weak.
  • Merging many sampled locations increases context and can confuse the model.

Core Entities

Models

  • GPT-4o (gpt-4o-2024-05-13)
  • text-embedding-3-small (OpenAI)

Metrics

  • % Resolved
  • Avg. $ Cost
  • Avg. # Tokens
  • % Correct Location (line/function/file)

Datasets

  • SWE-bench Lite (300 problems)
  • SWE-bench LiteS (249 filtered problems)
  • SWE-bench Verified (500 issues, OpenAI)

Benchmarks

  • SWE-bench Lite
  • SWE-bench LiteS
  • SWE-bench Verified