Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
13
Why It Matters For Business
A focused, non-agentic pipeline cuts cost and engineering overhead while matching or exceeding many open-source agentic systems on repo-level bug fixes.
Summary TLDR
The paper shows a lightweight, non-agentic pipeline (AGENTLESS) for fixing real GitHub issues: hierarchical localization (file → function → edit), LLM-based patch sampling in a small diff format, and LLM-generated reproduction tests plus regression testing for selection. On SWE-bench Lite (300 problems) AGENTLESS fixes 96 issues (32.00%) at an average cost of $0.70, outperforming prior open-source agentic tools while being simpler and cheaper. The authors also hand-audit SWE-bench Lite, remove problematic cases, and publish a filtered set (SWE-bench LiteS).
Problem Statement
Current LLM agent frameworks are complex, costly, and fragile. The paper asks: can a simple, non-agentic pipeline (no autonomous tool use or multi-turn planning) match or beat agent-based approaches on real repo-level coding tasks?
Main Contribution
AGENTLESS: a three-phase agentless pipeline (hierarchical localization, patch sampling with simple diff edits, and validation via reproduction + regression tests).
Empirical evaluation on SWE-bench Lite showing 32.00% resolved (96/300) at $0.70 average cost, competitive with or better than open-source agents.
Manual audit of SWE-bench Lite that finds problematic items (exact patches, missing info, misleading solutions) and a cleaned subset called SWE-bench LiteS.
Key Findings
AGENTLESS resolves 96 of 300 SWE-bench Lite problems
Average inference cost per issue is low
Generated reproduction tests often reproduce the issue but rarely fully validate fixes
SWE-bench Lite contains problematic cases that bias evaluation
Results
%Resolved
Avg. $ Cost
Avg. # Tokens
%Correct Location (file)
Reproduction tests that reproduce issue
Reproduction tests that validate ground-truth fixes
Who Should Care
What To Try In 7 Days
Run AGENTLESS-style pipeline on a small set of repo issues: localize → sample diff patches → validate with regression + generated tests.
Add a lightweight embedding retrieval step (chunk embeddings via OpenAI) and a file-skeleton prompt to reduce LLM context size.
Audit your in-house bug reports for exact-patch leaks or missing info; filter them before model evaluation.
Agent Features
Memory
- no long-term retrieval memory
Tool Use
- no autonomous tool execution
- no multi-turn action planning
Frameworks
- LlamaIndex
- OpenAI APIs
Architectures
- prompting + embedding retrieval
- hierarchical localization (file → skeleton → edit)
Optimization Features
Token Efficiency
- skeleton format to reduce context size
- search/replace diffs to avoid re-generating full files
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Performance drops on problems with no location clues; agentic tools with search tools still do better there.
- Generated reproduction tests are imperfect: many reproduce the bug but fewer can validate fixes.
- Uses closed LLMs (GPT-4o); potential training-data leakage cannot be fully ruled out.
When Not To Use
- When you need agents to run complex toolchains or perform multi-step environment interactions.
- When issue descriptions lack any location hints and you need aggressive repository-wide search tools.
Failure Modes
- LLM is distracted by long file contents if skeleton compression is not used.
- Incorrect reproduction tests can bias patch selection if regression tests are weak.
- Merging many sampled locations increases context and can confuse the model.
Core Entities
Models
- GPT-4o (gpt-4o-2024-05-13)
- text-embedding-3-small (OpenAI)
Metrics
- % Resolved
- Avg. $ Cost
- Avg. # Tokens
- % Correct Location (line/function/file)
Datasets
- SWE-bench Lite (300 problems)
- SWE-bench LiteS (249 filtered problems)
- SWE-bench Verified (500 issues, OpenAI)
Benchmarks
- SWE-bench Lite
- SWE-bench LiteS
- SWE-bench Verified

